GNAT compiler switches and optimization

tkrauss

unread,

Oct 20, 2006, 6:47:44 AM10/20/06

to

I'm a bit stuck trying to figure out how to coax more performance
out of some Ada code. I suspect there is something simple (like
compiler switches) but I'm missing it. As an example I'm using
a simple matrix multiply and comparing it to similar code in
Fortran. Unfortunately the Ada code takes 3-4 times as long.

I'm using GNAT (GPL 2006) and GFortran (4.2.0) and the following
compile options:

gnat make -O3 -gnatp tst_array
gfortran -O3 tst_array.f95

Running them on 800x800 matrices (on my 2GHz laptop)

for Ada: "tst_array 800" runs in 18 seconds
for Fortran "tst_array 800" runs in 6 seconds

(if I use the fortran "matmul" intrinsic the fortran time drops to
2.5 seconds)

Note, I tried reordering the loops, removing the random calls, etc.
none of which made huge changes. There is something killing
performance
and/or a switch or two that I'm missing, but I can't seem to find it.
Any
thoughts?

Here is the code:

-- tst_array.adb
with Ada.Numerics.Float_Random; use Ada.Numerics.Float_Random;
with Ada.Command_Line; use Ada.Command_Line;
with Ada.Text_IO; use Ada.Text_IO;
with Ada.Calendar; use Ada.Calendar;

procedure tst_array is
package F_IO is new Ada.Text_IO.Float_IO(Float);
package D_IO is new Ada.Text_Io.Fixed_Io(Duration);
type Real_Matrix is array(Integer range <>,Integer range <>) of
Float;
type Matrix_Access is access Real_Matrix;

N : Positive;
G : Ada.Numerics.Float_Random.Generator;

A,B,C : Matrix_Access;
Start, Finish : Ada.Calendar.Time;
Sum : Float := 0.0;
begin
N := Positive'Value (Argument (1));
Start := Ada.Calendar.Clock;

A := new Real_Matrix(1..N, 1..N);
B := new Real_Matrix(1..N, 1..N);
C := new Real_Matrix(1..N, 1..N);
for I in A'range(1) loop
for J in A'range(2) loop
A(I,J) := Ada.Numerics.Float_Random.Random(G);
B(I,J) := Ada.Numerics.Float_Random.Random(G);
end loop;
end loop;

for I in A'range(1) loop
for J in A'range(2) loop
Sum := 0.0;
for R in A'range(2) loop
Sum := Sum + A(I,R)*B(R,J);
end loop;
C(I,J) := Sum;
end loop;
end loop;

Finish := Ada.Calendar.Clock;

F_IO.Put (C(1,1)); F_IO.Put (C(1,2)); New_Line;
F_IO.Put (C(2,1)); F_IO.Put (C(2,2)); New_Line;

Finish := Ada.Calendar.Clock;
Put ("Time: "); D_IO.Put(Finish-Start); New_Line;New_Line;
end tst_array;

-- tst_array.f95
program test_array
integer,parameter :: seed = 86456
integer :: N
integer :: numArgs
real, dimension(:,:), allocatable :: a, b, c
real, dimension(:,:), allocatable :: d, e, f
real :: sum
character*100 buffer
real :: begin, finish
integer :: I, J, R

call getarg(1,buffer)
read(buffer,*) N

call srand(seed)

begin = second()

allocate( a(N,N) )
allocate( b(N,N) )
allocate( c(N,N) )
forall (I = 1:N, J = 1:N)
a(i,j) = rand()
b(i,j) = rand()
end forall

!c = matmul(a, b)
do I = 1,N
do J = 1,N
sum = 0.0
do R = 1,N
sum = sum + A(I,R)*B(R,J)
end do
C(I,J) = sum
end do
end do

print *, c(1,1), c(1,2), c(2,1), c(2,2)
finish = second()
print *, 'Time: ', (finish-begin)

end program test_array

Duncan Sands

unread,

Oct 20, 2006, 7:04:55 AM10/20/06

to comp.l...@ada-france.org, tkrauss

On Friday 20 October 2006 12:47, tkrauss wrote:
> I'm a bit stuck trying to figure out how to coax more performance
> out of some Ada code. I suspect there is something simple (like
> compiler switches) but I'm missing it. As an example I'm using
> a simple matrix multiply and comparing it to similar code in
> Fortran. Unfortunately the Ada code takes 3-4 times as long.

GNAT GPL 2006 is based on gcc 3.4.6. For fortran you are using
gcc 4.2.0. Try using comparable compiler versions, eg: an Ada
aware gcc 4.2.0 (several linux distributions provide this) or
a gcc 3.4.6 version of fortran (i.e. some version of g77).

Ciao,

Duncan.

Duncan Sands

unread,

Oct 20, 2006, 7:42:23 AM10/20/06

to comp.l...@ada-france.org, tkrauss

On Friday 20 October 2006 12:47, tkrauss wrote:

> I'm a bit stuck trying to figure out how to coax more performance
> out of some Ada code. I suspect there is something simple (like
> compiler switches) but I'm missing it. As an example I'm using
> a simple matrix multiply and comparing it to similar code in
> Fortran. Unfortunately the Ada code takes 3-4 times as long.

GNAT GPL 2006 is based on gcc 3.4.6. For fortran you are using

Samuel Tardieu

unread,

Oct 20, 2006, 8:09:45 AM10/20/06

to

>>>>> "tkrauss" == tkrauss <thomas...@gmail.com> writes:

tkrauss> Running them on 800x800 matrices (on my 2GHz laptop)

tkrauss> for Ada: "tst_array 800" runs in 18 seconds for Fortran
tkrauss> "tst_array 800" runs in 6 seconds

tkrauss> (if I use the fortran "matmul" intrinsic the fortran time
tkrauss> drops to 2.5 seconds)

tkrauss> Note, I tried reordering the loops, removing the random
tkrauss> calls, etc. none of which made huge changes. There is
tkrauss> something killing performance and/or a switch or two that I'm
tkrauss> missing, but I can't seem to find it. Any thoughts?

First of all, what you measure is not only the matrix multiplication
time but also the operation of filling the matrices with random
numbers. I've moved the "start" initialization after the matrices
initialization.

The following optimizations make the difference smaller (9.47 seconds
for Fortran vs. 11.90 seconds for Ada on my machine):

- use -fomit-frame-pointer on gnatmake command line (this doesn't
change anything in the Fortran case)

- add: pragma Convention (Fortran, Real_Matrix) to invert the
storage method (line vs. column); I guess this helps maintaining
more data in the cache

- use 1 .. N as loop indices instead of A'Range (1) and friends;
this is more equivalent to the Fortran code you posted

Still, this is a huge penaly for Ada. Unfortunately, I don't have the
time to investigate further right now. However, I would be interested
in other people findings.

Sam
--
Samuel Tardieu -- s...@rfc1149.net -- http://www.rfc1149.net/

Gautier

unread,

Oct 20, 2006, 8:12:26 AM10/20/06

to

tkrauss:

> I'm a bit stuck trying to figure out how to coax more performance
> out of some Ada code. I suspect there is something simple (like
> compiler switches) but I'm missing it. As an example I'm using
> a simple matrix multiply and comparing it to similar code in
> Fortran. Unfortunately the Ada code takes 3-4 times as long.

One thing is already that you start the timer at the wrong place.
You should start it after filling your array with random numbers.
In the present state you compare the random generator, then your matrix
multiplication. It is possible that Ada.Numerics.Float_Random.Random takes
significantly more time due to Ada's quality requirements in the random
generation.
On the other hand, switches you can try are:
-O2 instead of -O3, -funroll-loops (usually good) -ffast-math,
for both Ada and Fortran
-gnatn, for Ada
Since you are measuring real time and not CPU time, you might want also to
take a bit larger matrices in order to have disk swaps and rests of
initializations effects statistically small.

HTH, Gautier
______________________________________________________________
Ada programming -- http://www.mysunrise.ch/users/gdm/gsoft.htm

NB: For a direct answer, e-mail address on the Web site!!

Samuel Tardieu

unread,

Oct 20, 2006, 8:18:14 AM10/20/06

to

>>>>> "Sam" == Samuel Tardieu <s...@rfc1149.net> writes:

Sam> Still, this is a huge penaly for Ada. Unfortunately, I don't have
Sam> the time to investigate further right now. However, I would be
Sam> interested in other people findings.

Oh, also this one lowers the Ada execution time by around 10%: do not
use an unconstrained type and do not use pointers.

N : constant Positive := Positive'Value (Argument (1));
G : Ada.Numerics.Float_Random.Generator;

type Real_Matrix is array(1 .. N, 1 .. N) of Float;
pragma Convention (Fortran, Real_Matrix);

A,B,C : Real_Matrix;

And as Duncan says, you should try with a newer backend for Ada.

Dmitry A. Kazakov

unread,

Oct 20, 2006, 8:35:01 AM10/20/06

to

On 20 Oct 2006 03:47:44 -0700, tkrauss wrote:

> I'm a bit stuck trying to figure out how to coax more performance
> out of some Ada code. I suspect there is something simple (like
> compiler switches) but I'm missing it. As an example I'm using
> a simple matrix multiply and comparing it to similar code in
> Fortran. Unfortunately the Ada code takes 3-4 times as long.
>
> I'm using GNAT (GPL 2006) and GFortran (4.2.0) and the following
> compile options:
>
> gnat make -O3 -gnatp tst_array
> gfortran -O3 tst_array.f95
>
> Running them on 800x800 matrices (on my 2GHz laptop)
>
> for Ada: "tst_array 800" runs in 18 seconds
> for Fortran "tst_array 800" runs in 6 seconds
>
> (if I use the fortran "matmul" intrinsic the fortran time drops to
> 2.5 seconds)
>
> Note, I tried reordering the loops, removing the random calls, etc.
> none of which made huge changes. There is something killing
> performance
> and/or a switch or two that I'm missing, but I can't seem to find it.
> Any thoughts?
>
> Here is the code:

[...]

Pointers + index checks I'd say. Try this:

with Ada.Numerics.Float_Random; use Ada.Numerics.Float_Random;
with Ada.Command_Line; use Ada.Command_Line;
with Ada.Text_IO; use Ada.Text_IO;
with Ada.Calendar; use Ada.Calendar;

procedure tst_array_2 is

package F_IO is new Ada.Text_IO.Float_IO(Float);
package D_IO is new Ada.Text_Io.Fixed_Io(Duration);
type Real_Matrix is array(Integer range <>,Integer range <>) of Float;

N : Positive;
G : Ada.Numerics.Float_Random.Generator;

Start, Finish : Ada.Calendar.Time;
Sum : Float := 0.0;
begin
N := Positive'Value (Argument (1));

declare
subtype Matrix is Real_Matrix (1..N, 1..N);
A, B, C : Matrix;
begin

for I in A'range(1) loop
for J in A'range(2) loop
A(I,J) := Ada.Numerics.Float_Random.Random(G);
B(I,J) := Ada.Numerics.Float_Random.Random(G);
end loop;
end loop;

Start := Ada.Calendar.Clock;

for I in A'range(1) loop
for J in A'range(2) loop
Sum := 0.0;
for R in A'range(2) loop
Sum := Sum + A(I,R)*B(R,J);
end loop;
C(I,J) := Sum;
end loop;
end loop;

Finish := Ada.Calendar.Clock;

F_IO.Put (C(1,1)); F_IO.Put (C(1,2)); New_Line;
F_IO.Put (C(2,1)); F_IO.Put (C(2,2)); New_Line;

end;

Put ("Time: "); D_IO.Put(Finish-Start); New_Line;New_Line;

end tst_array_2;

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Gautier

unread,

Oct 20, 2006, 8:52:34 AM10/20/06

to

tkrauss wrote:

> N := Positive'Value (Argument (1));

...

> for I in A'range(1) loop
> for J in A'range(2) loop

probably you have a penalty there against

> do I = 1,N
> do J = 1,N

I guess the GNAT compiler can't tell that due to

> A := new Real_Matrix(1..N, 1..N);

esp. because alloc. is dynamic, you indeed have:
for I in 1..N loop
Then you should try that and even
for I in reverse 1..N loop
Perhaps g95 does that optimization when Ada has to stick with the mentioned
direction.
Also, compiling with -S and looking at the assembler code helps a lot...

claude...@equipement.gouv.fr

unread,

Oct 20, 2006, 9:27:21 AM10/20/06

to

with some change you could speedup your code :

First the matrix product is not optimum, you can transpose B.

Second, matrix A is used vector by vector you can copy A (I,1..N) in a
vector, say AI and replace A(I,R) by AI(R) in the product (it is to the
compiler to see that :-( ).

By apply this change I got an ada version faster that the fortran one
(at the same -O3 optimization level).

You can make the same change to your fortran code and then search a new
trick to speedup your ada code :-).

Claude SIMON

tkrauss a écrit :

Robert A Duff

unread,

Oct 20, 2006, 11:38:12 AM10/20/06

to

"tkrauss" <thomas...@gmail.com> writes:

> I'm a bit stuck trying to figure out how to coax more performance

> out of some Ada code. ...

In some cases, -O3 is slower than -O2. You could try that experiment.

Is the Fortran compiler generating array-bounds checks? If not,
pragma Suppress(All_Checks) in the Ada version will make the test more
fair.

- Bob

Martin Krischik

unread,

Oct 20, 2006, 11:41:12 AM10/20/06

to

Duncan Sands wrote:

Indeed. GNAT/GPL:

./Linux-x86_64-Release/Test_Array_1 800
1.99202E+02 1.96854E+02
1.88844E+02 1.84498E+02
Time: 12.265258000

GNAT/GCC:

./Linux-x86_64-Release/Test_Array_1 800
1.99202E+02 1.96854E+02
1.88844E+02 1.84498E+02
Time: 10.631787000

So who said the 4.1.x compiler are slower...

Martin
--
mailto://kris...@users.sourceforge.net
Ada programming at: http://ada.krischik.com

Jeffrey Creem

unread,

Oct 20, 2006, 11:56:50 AM10/20/06

to

tkrauss wrote:
> I'm a bit stuck trying to figure out how to coax more performance
> out of some Ada code. I suspect there is something simple (like
> compiler switches) but I'm missing it. As an example I'm using
> a simple matrix multiply and comparing it to similar code in
> Fortran. Unfortunately the Ada code takes 3-4 times as long.
>

There have been a few useful comments (and quite a few not really useful
ones) but in the end, it seems pretty clear to me that in this
particular case GNAT sucks compared to the fortran version.

I built the gcc "head" from gcc SVN with GNAT and Fortran to compare the
same versions (at least as much as possible).

I moved the start timing calls after the array allocation and filling so
we just timing the matrix multiplication

I end moved the timing calls to make sure we were not timing IO in
either case (both original versions were timing part of the "put").

I replaced the "random" data with some fixed sane data just to be sure
there was no funky "denormal" stuff happening that changed the speed.

Very little change in the order of magnitude that the original poster
was seeing (I pretty much get results with GNAT runnig about 2.6 times
slower) so it was time to look at the assembly.

I find it easier to read assembly using sse math so building gnat via

gnatmake -g -f -gnatp -O3 -march=pentium4 -fomit-frame-pointer
-mfpmath=sse tst_array

and fotran via

gfortran -O3 -g -march=pentium4 -fomit-frame-pointer -mfpmath=sse -c
tst_array.f95

and then using
objdump -D -S tst_array.o

to look at them, you pretty quickly can see the problem.

The "inner loop" of the fortran code looks like

2d0: 8d 04 19 lea (%ecx,%ebx,1),%eax
2d3: f3 0f 10 02 movss (%edx),%xmm0
2d7: f3 0f 59 44 85 04 mulss 0x4(%ebp,%eax,4),%xmm0
2dd: f3 0f 58 c8 addss %xmm0,%xmm1
2e1: 83 c1 01 add $0x1,%ecx
2e4: 01 f2 add %esi,%edx
2e6: 39 f9 cmp %edi,%ecx
2e8: 75 e6 jne 2d0 <MAIN__+0x2d0>

The "inner loop of the Ada code looks like

af2: 83 c6 01 add $0x1,%esi
af5: 89 f0 mov %esi,%eax
af7: 2b 44 24 28 sub 0x28(%esp),%eax
afb: 03 44 24 30 add 0x30(%esp),%eax
aff: 8b 5c 24 38 mov 0x38(%esp),%ebx
b03: f3 0f 10 0c 83 movss (%ebx,%eax,4),%xmm1
b08: 8b 4d 00 mov 0x0(%ebp),%ecx
b0b: 8b 45 0c mov 0xc(%ebp),%eax
b0e: 8b 55 08 mov 0x8(%ebp),%edx
b11: 8b 5c 24 78 mov 0x78(%esp),%ebx
b15: 29 d3 sub %edx,%ebx
b17: 89 f7 mov %esi,%edi
b19: 29 cf sub %ecx,%edi
b1b: 89 f9 mov %edi,%ecx
b1d: 83 c0 01 add $0x1,%eax
b20: 29 d0 sub %edx,%eax
b22: 01 c0 add %eax,%eax
b24: 01 c0 add %eax,%eax
b26: ba 00 00 00 00 mov $0x0,%edx
b2b: 0f 48 c2 cmovs %edx,%eax
b2e: 0f af c8 imul %eax,%ecx
b31: 8d 1c 99 lea (%ecx,%ebx,4),%ebx
b34: 8b 44 24 3c mov 0x3c(%esp),%eax
b38: f3 0f 10 04 03 movss (%ebx,%eax,1),%xmm0
b3d: f3 0f 59 c1 mulss %xmm1,%xmm0
b41: f3 0f 58 d0 addss %xmm0,%xmm2
b45: 3b 74 24 7c cmp 0x7c(%esp),%esi
b49: 75 a7 jne af2 <_ada_tst_array+0x254>

28 Instructions v.s. 8 for fortran.

The GNAT version never stood a chance. It really seems like GNAT is
dropping the ball here.

Granted small benchmarks can really lead one to believe things are
better or worse than the truth but I don't think there is really an
excuse in this case for this sort of performance.

Martin Krischik

unread,

Oct 20, 2006, 11:53:58 AM10/20/06

to

Dmitry A. Kazakov wrote:

> declare
> subtype Matrix is Real_Matrix (1..N, 1..N);
> A, B, C : Matrix;
> begin

Indeed faster:

./Linux-x86_64-Release/Test_Array_1 800
1.99202E+02 1.96854E+02
1.88844E+02 1.84498E+02

Time: 10.459045000

./Linux-x86_64-Release/Test_Array_2 800
1.99202E+02 1.96854E+02
1.88844E+02 1.84498E+02
Time: 6.313824000

Martin Krischik

unread,

Oct 20, 2006, 12:30:29 PM10/20/06

to

tkrauss wrote:

> Unfortunately the Ada code takes 3-4 times as long.

Well, I collected the various suggestions made and created three test
programs [1,2,3] where one is indeed faster then the next.

GNAT 4.1.1

./Linux-x86_64-Release/Test_Array_1 800
1.99202E+02 1.96854E+02
1.88844E+02 1.84498E+02

Time: 18.615262000

./Linux-x86_64-Release/Test_Array_2 800
1.99202E+02 1.96854E+02
1.88844E+02 1.84498E+02

Time: 5.457141000

./Linux-x86_64-Release/Test_Array_3 800
1.99202E+02 1.96854E+02
1.88844E+02 1.84498E+02
Time: 4.509063000

GNAT GPL 2006 (20060522-34)

./Linux-x86_64-Release/Test_Array_1 800
1.99202E+02 1.96854E+02
1.88844E+02 1.84498E+02

Time: 13.950607000

./Linux-x86_64-Release/Test_Array_2 800
1.99202E+02 1.96854E+02
1.88844E+02 1.84498E+02

Time: 8.010260000

./Linux-x86_64-Release/Test_Array_3 800
1.99202E+02 1.96854E+02
1.88844E+02 1.84498E+02
Time: 7.101502000

As options I used my trusted "Release" option which you can see here [4].

Martin

[1]http://svn.sourceforge.net/viewvc/wikibook-ada/trunk/demos/Source/test_array_1.adb?view=markup
[2]http://svn.sourceforge.net/viewvc/wikibook-ada/trunk/demos/Source/test_array_2.adb?view=markup
[3]http://svn.sourceforge.net/viewvc/wikibook-ada/trunk/demos/Source/test_array_3.adb?view=markup
[4]http://svn.sourceforge.net/viewvc/wikibook-ada/trunk/demos/GNAT/wikibook_ada.gpr?view=markup

Gautier

unread,

Oct 20, 2006, 3:32:26 PM10/20/06

to

Robert A Duff:

> Is the Fortran compiler generating array-bounds checks? If not,
> pragma Suppress(All_Checks) in the Ada version will make the test more
> fair.

Thomas mentioned the -gnatp option, that has the same effect as "pragma
Suppress(All_Checks)" in the source.

G.

Gautier

unread,

Oct 20, 2006, 3:51:13 PM10/20/06

to

Just a detail (should not make much, but who knows with Ada.Text_IO): the
timer should be stopped just after the multiplication and before any output.
This little bug is in both Fortran and Ada code (in the Ada code the finish
time is even taken twice).
G.

Jeffrey R. Carter

unread,

Oct 20, 2006, 6:11:17 PM10/20/06

to

tkrauss wrote:
> I'm a bit stuck trying to figure out how to coax more performance
> out of some Ada code. I suspect there is something simple (like
> compiler switches) but I'm missing it. As an example I'm using
> a simple matrix multiply and comparing it to similar code in
> Fortran. Unfortunately the Ada code takes 3-4 times as long.

1st, make sure you're timing the same thing. How does Ada matrix
multiplication compare to FORTRAN? You don't know, because you're also
timing the random number generators and I/O. Ada.Text_IO, especially, is
quite heavy-weight compared to other languages.

Here's my version of your program:

with Ada.Numerics.Float_Random;

with Ada.Command_Line; use Ada.Command_Line;
with Ada.Text_IO; use Ada.Text_IO;
with Ada.Calendar; use Ada.Calendar;

procedure Tst_Array is

package F_IO is new Ada.Text_IO.Float_IO (Float);
package D_IO is new Ada.Text_Io.Fixed_Io (Duration);

N : constant Positive := Integer'Value (Argument (1) );

type Real_Matrix is array (1 .. N, 1 .. N) of Float;
pragma Convention (FORTRAN, Real_Matrix);

G : Ada.Numerics.Float_Random.Generator;

A,B : Real_Matrix :=
(others => (others => Ada.Numerics.Float_Random.Random (G) ) );
C : Real_Matrix := (others => (others => 0.0) );
Start, Finish : Ada.Calendar.Time;
begin
Start := Ada.Calendar.Clock;

for I in A'range (1) loop

for J in A'range (2) loop

for R in A'range (2) loop
C (I, J) := C (I, J) + A (I, R) * B (R, J);
end loop;
end loop;
end loop;

Finish := Ada.Calendar.Clock;

F_IO.Put (C (1, 1) );
F_IO.Put (C (1, 2) );
New_Line;
F_IO.Put (C (2, 1) );
F_IO.Put (C (2, 2) );
New_Line;

Put ("Time: ");
D_IO.Put (Finish - Start);
New_Line;
end Tst_Array;

I compiled as

gnatmake -O3 -gnatnp -fomit-frame-pointer tst_array

With MinGW GNAT 3.4.2, Windows XP, and a 3.2 GHz Pentium 4 HT processor,
I get

C:\Documents and Settings\All Users\Documents\Code>tst_array 800
2.05839E+02 2.01230E+02
2.00866E+02 1.94039E+02
Time: 5.821082831

I don't have a FORTRAN compiler for comparison. FORTRAN will probably do
better, but I wouldn't expect the difference to be as great as your values.

--
Jeff Carter
"I was hobbling along, minding my own business, all of a
sudden, up he comes, cures me! One minute I'm a leper with
a trade, next minute my livelihood's gone! Not so much as a
'by your leave!' You're cured, mate. Bloody do-gooder!"
Monty Python's Life of Brian
76

Jeffrey Creem

unread,

Oct 20, 2006, 7:52:54 PM10/20/06

to

Jeffrey R. Carter wrote:
> tkrauss wrote:
>
>> I'm a bit stuck trying to figure out how to coax more performance
>> out of some Ada code. I suspect there is something simple (like
>> compiler switches) but I'm missing it. As an example I'm using
>> a simple matrix multiply and comparing it to similar code in
>> Fortran. Unfortunately the Ada code takes 3-4 times as long.
>
>
> 1st, make sure you're timing the same thing. How does Ada matrix
> multiplication compare to FORTRAN? You don't know, because you're also
> timing the random number generators and I/O. Ada.Text_IO, especially, is
> quite heavy-weight compared to other languages.
>
> Here's my version of your program:
>

Note, I am the first one to jump to the defense of "Ada" in general but
in this case, GNAT just plain sucks compared to GNU FORTRAN as it does a
poor job on (at least) the inner loop (verified by looking at the output
assembly)

Jeff's (the other jeff :) modified version looks a little cleaner and
actually runs faster (than even old "fixed version" that did not time
the IO and made sure to just time the matrix multiply in both versions)
but it does not time the zeroing of the elements of C which would be
required if this were a real matrix multiply routine and not some test
driver.

However, even having said that, this not really equivilent version runs
about 2x slower than the FORTRAN (with the same version of GCC)

I don't see any meaningful excuse for why GNAT should be slower in this
case but it clearly is.

I tried looking at the tree dump generated by the front ends prior to
going to the optimizer step (not something I have a lot of experience
at) . One thing is clear is the trees generated by GNAT is quite a bit
uglier and more verbose so it is not surprising that the optimizer is
unable to fully clean things up resulting in the explosion of
instructions at the assembly level.

Gautier

unread,

Oct 21, 2006, 3:37:41 AM10/21/06

to

Jeffrey Creem:

> Note, I am the first one to jump to the defense of "Ada" in general but
> in this case, GNAT just plain sucks compared to GNU FORTRAN as it does a
> poor job on (at least) the inner loop (verified by looking at the output
> assembly)

There is something strange... Martin Krischik was able to trim the overall
time for the Ada code down to 24% of the first version (GNAT/GCC 4.1.1).
This should make the Ada program as fast as the FORTRAN one, shouldn't it ?
Maybe it's because the test is done on a 64 bit machine ?
It needs some reconciliation...
A good thing in that discussion would be that everybody shows each time
- which GCC version
- which machine
- the execution time of the multiplication for both Ada and Fortran
- which version of the Ada code (matrix on stack/heap, Fortran or Ada convention)

Cheers, Gautier

Stephen Leake

unread,

Oct 21, 2006, 6:45:37 AM10/21/06

to

Duncan Sands <duncan...@math.u-psud.fr> writes:

I have a pre-release of GNAT 5.05, which is based on gcc 4.2. I tried
tst_array with both -O2 and -O3; -O2 was faster. AdaCore says -O3 has
"dangerous experimental optimizations; don't use it". Here's a
comparison:

GNAT 5.04, -O2 -gnatp:

./tst_array.exe 800
Init Time: 0.557314686

Mult Time: 20.139092538

1.99202E+02 1.96854E+02
1.88844E+02 1.84498E+02

GNAT 5.05, -O2 -gnatp:

./tst_array.exe 800
Init Time: 0.575690486

Mult Time: 12.160329316

1.99202E+02 1.96854E+02
1.88844E+02 1.84498E+02

Note that I timed the initialize with random separate from the matrix
multiply; it could easily be that the random number generators are
different. But that's a small effect.

I don't have gfortran installed.

--
-- Stephe

Dr. Adrian Wrigley

unread,

Oct 21, 2006, 8:39:30 AM10/21/06

to

On Fri, 20 Oct 2006 03:47:44 -0700, tkrauss wrote:

> I'm a bit stuck trying to figure out how to coax more performance
> out of some Ada code. I suspect there is something simple (like
> compiler switches) but I'm missing it. As an example I'm using
> a simple matrix multiply and comparing it to similar code in
> Fortran. Unfortunately the Ada code takes 3-4 times as long.
>
> I'm using GNAT (GPL 2006) and GFortran (4.2.0) and the following
> compile options:
>
> gnat make -O3 -gnatp tst_array
> gfortran -O3 tst_array.f95
>
> Running them on 800x800 matrices (on my 2GHz laptop)
>
> for Ada: "tst_array 800" runs in 18 seconds
> for Fortran "tst_array 800" runs in 6 seconds
>
> (if I use the fortran "matmul" intrinsic the fortran time drops to
> 2.5 seconds)
>
> Note, I tried reordering the loops, removing the random calls, etc.
> none of which made huge changes. There is something killing
> performance
> and/or a switch or two that I'm missing, but I can't seem to find it.
> Any
> thoughts?

When I started using Ada, I found exactly the same thing. Programs
ran slower. Sometimes much slower. Perhaps this is the biggest
single disadvantage of Ada (GNAT) in practice, particularly for
heavy numerical codes (compared to Fortran or C).

Looking closer, I found the assembly output was sometimes
littered with apparently redundant or inintended function
calls, extra checks and other baggage.

The prevailing claims at the time were that Ada was roughly
as fast as C, sometimes faster (because of deeper semantics).
Whenever logically identical code was compared, however,
the output assembly code was often identical, giving the
same performance.

In real code, I found big differences when using enumerations
instead of integers, multi-dimansional arrays etc. And
language-defined maths libraries, random number generators,
I/O etc. all risked major slow-downs.

The only solution I found was to examine the output code,
and modify the source until I got code that was acceptable.
Sometimes this meant dropping useful language features, or
using less clear constructs.

This is a real problem with the language (Ada using GNAT).
For a lot of applications, the optimisation effort isn't
worth it. And even for performance critical applications,
most code is outside of any hot-spots.

I get the impression that Fortran is an excellent language
for getting efficient code easily. C is also quite good.
But C++ and Ada both seem to "hand-holding" to keep them
efficient. Perl is the worst I know(!)

If you really care about performance, you'll check the
assembly code or compare against expectations. As you
fix unexpected bottlenecks, you'll find out what type of
code compiles well with your compiler, and write future
code avoiding the problem areas. Of course, when the
compiler changes, you may find the rules change and your
old code appears quaint or idiosyncratic.

The other posters to this thread have given some useful
optimisations to your original code. Let us know
whether this bridges the performance gap for you!
--
Adrian Wrigley, Cambridge, UK.

Jeffrey Creem

unread,

Oct 21, 2006, 12:35:54 PM10/21/06

to

I'd certainly be willing to run a few benchmarks but the important thing
here is that rather innocent looking code is running 2-4x slower than it
"should".

There are things that I think we can really rule out as being "the" factor.

1) Random number generator - I did timings (for both the Ada and
FORTRAN) with timing moved to only cover matrix multiply.
2) Difference GCC versions - I built a fresh GCC from the GCC trunk for
both Ada and FORTRAN
3) The Machine - I am running both on the same machine, though I suppose
there could be differences in 32 bit v.s. 64 bit comparisons.
4) Runtime checks - both the original author (and I) ran with checks
suppressed
5) O2/O3 - Actually, I could look at this some more with some other
versions but a quick look when I first started seemed to indicate this
was not the issue.

A few other thoughts.

Once the timing is limited to just the matrix multiply the stack/heap
thing 'should' generally not matter.

Some of the changes made to the Ada version make it not really the same
program as the FORTRAN version and the same changes made to the FORTRAN
one would also cause it to speed up (e.g. not counting the the zeroing
of the target array during the accumulation phase).

I have certainly seen some amazing performance from some Ada compiler
sin the past and in general, on non-trivial benchmarks I am usually
pretty happy with the output of GNAT as well but in this case it is not
great.

Further, I tried playing a bit with the new autovectorization capability
of the near 4.X series of GCC (has to be specifically enabled) and found
that even very very trivial cases would refuse to vectorize under Ada
(though after I submitted the bug report to GCC, I found that FORTRAN
fails to vectorize these too).

One thing everyone needs to remember is that this example was (probably)
not "Find the way to get the smallest value out of this test program"
becuase there are always ways of doing some tweaks to a small enough
region of code to make it better. If there is a 2-4x global slowdown in
your 100KLOC program, you will never "get there" following the
conventional wisdom of profiling and looking for the problems.

Now, I am not suggesting that GNAT is globally 2-4x slower than GFORTRAN
or anything like that (since that does not line up with what I have
generally seen on larger code bases), but, if I were a manager picking a
new language based on a set of long term goals for a project and saw
that GNAT was running 2-4x slower and was still runninging 1.X to 3X
slower after 2 days of Ada guru's looking at it, I'd probably jettison
Ada (I know, I am mixing compilers and languages here, but in reality,
that is what happens in the real world) and go with something else.

And before the chorus of "processors are so fast that performance does
not matter as much as safety and correctness" crowd starts getting too
loud, let me point out that there are still many segments of the
industry where performance does still indeed matter. Especially when one
is trading adding a second processor to an embedded box against a vague
promise of "betterness" in terms of safety down the road....Ok..Off the
soapbox.

So, in closing, if someone thinks they have "the best" version of that
program they want timed against gfortran, post it here and I'll run them.

Pascal Obry

unread,

Oct 21, 2006, 1:04:47 PM10/21/06

to Jeffrey Creem

Jeffrey Creem a écrit :

> I'd certainly be willing to run a few benchmarks but the important thing
> here is that rather innocent looking code is running 2-4x slower than it
> "should".

The first thing is to be sure that we are running the same program.

Running your program with the following changes (as done in Fortran):

1. Using Sum tmp var for the computation

for I in A'range (1) loop
for J in A'range (2) loop

Sum := 0.0;

for R in A'range (2) loop

Sum := Sum + A (I, R) * B (R, J);
end loop;
C (I, J) := Sum;
end loop;
end loop;

2. Using Long_Float instead of Float (I think Fortran float is a
Long_Float, to be checked).

I went from 7.8s to 4.8s (with 1) and to 4.2s (with 2).

Pascal.

--

--|------------------------------------------------------
--| Pascal Obry Team-Ada Member
--| 45, rue Gabriel Peri - 78114 Magny Les Hameaux FRANCE
--|------------------------------------------------------
--| http://www.obry.net
--| "The best way to travel is by means of imagination"
--|
--| gpg --keyserver wwwkeys.pgp.net --recv-key C1082595

tmo...@acm.org

unread,

Oct 21, 2006, 2:28:47 PM10/21/06

to

> I'd certainly be willing to run a few benchmarks but the important thing
> here is that rather innocent looking code is running 2-4x slower than it
> "should".

But:

> > for Ada: "tst_array 800" runs in 18 seconds
> > for Fortran "tst_array 800" runs in 6 seconds
> >
> > (if I use the fortran "matmul" intrinsic the fortran time drops to
> > 2.5 seconds)

Clearly the right way to do large matrix multiplies is to call matmul.
I'd be surprised if it made any significant difference whether it was
called from Ada or from Fortran.

Jeffrey Creem

unread,

Oct 21, 2006, 5:22:05 PM10/21/06

to

Pascal Obry wrote:
> Jeffrey Creem a écrit :
>
>
>>I'd certainly be willing to run a few benchmarks but the important thing
>>here is that rather innocent looking code is running 2-4x slower than it
>> "should".
>
>
> The first thing is to be sure that we are running the same program.
>
> Running your program with the following changes (as done in Fortran):
>
> 1. Using Sum tmp var for the computation
>
> for I in A'range (1) loop
> for J in A'range (2) loop
> Sum := 0.0;
> for R in A'range (2) loop
> Sum := Sum + A (I, R) * B (R, J);
> end loop;
> C (I, J) := Sum;
> end loop;
> end loop;
>
> 2. Using Long_Float instead of Float (I think Fortran float is a
> Long_Float, to be checked).
>
> I went from 7.8s to 4.8s (with 1) and to 4.2s (with 2).
>
> Pascal.
>

Actually, the original poster's Ada program had the temp var and all of
my comparisons of programs that I have asserted were "the same" used the
temporary.

As for Long_Float v.s. Short_Float, gfortran is using 32 bit floats (as
verified by dumping the tree representation and assembly language).

Since we are all getting a bit confused by specific versions and numbers
I thought I'd post a summary and create a way of tracking more global
performance issues at the gnuada.sf.net wiki.

The direct link to these test results are here:

http://gnuada.sourceforge.net/pmwiki.php/Main/Oct2006CLAGFORTRANComparison

Jeffrey Creem

unread,

Oct 21, 2006, 11:03:14 PM10/21/06

to

Actually, as a result of this, I submitted a bug report to the GCC
bugzilla list. You can follow progress on it here:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29543

Interesting initial feedback is that
1) Not an Ada bug.
2) Is a FORTRAN bug
3) Is a backend limitation of the optimizer.

Of course, the FORTRAN one still runs correctly so I don't think most
users will care that it is because of a bug :)

Jeffrey R. Carter

unread,

Oct 22, 2006, 3:39:27 AM10/22/06

to

Jeffrey Creem wrote:
>
> Actually, as a result of this, I submitted a bug report to the GCC
> bugzilla list. You can follow progress on it here:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29543
>
> Interesting initial feedback is that
> 1) Not an Ada bug.
> 2) Is a FORTRAN bug
> 3) Is a backend limitation of the optimizer.
>
> Of course, the FORTRAN one still runs correctly so I don't think most
> users will care that it is because of a bug :)

Interesting. I've been experimenting with some variations simply out of
curiosity and found some things that seem a bit strange. (All results
for an argument of 800.)

Adding the Sum variable makes an important difference, as others have
reported, in my case from 5.82 to 4.38 s. Hoisting the indexing
calculation for the result (C) matrix location is a basic optimization,
and I would be surprised if it isn't done. The only thing I can think of
is that it's a cache issue: that all 3 matrices can't be kept in cache
at once. Perhaps compiler writers would be able to make sense of this.

Previously, I found no difference between -O2 and -O3. With this change,
-O2 is faster.

The issue of using 'range compared to using "1 .. N" makes no difference
in my version of the program.

Something I found really surprising is that putting the multiplication
in a procedure makes the program faster, down to 4.03 s. I have no idea
why this would be so.

Compiled with MinGW GNAT 3.4.2, -O2, -gnatnp -fomit-frame-pointer. Run
under Windows XP SP2 on a 3.2 GHz Pentium 4 HT with 1 GB RAM.

Here's the code:

with Ada.Numerics.Float_Random;
with Ada.Command_Line; use Ada.Command_Line;
with Ada.Text_IO; use Ada.Text_IO;
with Ada.Calendar; use Ada.Calendar;

procedure Tst_Array is
package F_IO is new Ada.Text_IO.Float_IO (Float);
package D_IO is new Ada.Text_Io.Fixed_Io (Duration);

N : constant Positive := Integer'Value (Argument (1) );

type Real_Matrix is array (1 .. N, 1 .. N) of Float;
pragma Convention (FORTRAN, Real_Matrix);

G : Ada.Numerics.Float_Random.Generator;

A,B : Real_Matrix :=
(others => (others => Ada.Numerics.Float_Random.Random (G) ) );
C : Real_Matrix := (others => (others => 0.0) );
Start, Finish : Ada.Calendar.Time;

procedure Multiply is
Sum : Float;
begin -- Multiply
All_Rows : for Row in A'range (1) loop
All_Columns : for Column in B'range (2) loop
Sum := 0.0;

All_Common : for R in A'range (2) loop
Sum := Sum + A (Row, R) * B (R, Column);
end loop All_Common;

C (Row, Column) := Sum;
end loop All_Columns;
end loop All_Rows;
end Multiply;
begin
Start := Ada.Calendar.Clock;
Multiply;
Finish := Ada.Calendar.Clock;

F_IO.Put (C (1, 1) );
F_IO.Put (C (1, 2) );
New_Line;
F_IO.Put (C (2, 1) );
F_IO.Put (C (2, 2) );
New_Line;

Put ("Time: ");
D_IO.Put (Finish - Start);
New_Line;
end Tst_Array;

Next, since there have been reported some meaningful speed-up of quick
sort on a Pentium 4 HT processor by using 2 tasks, I thought I'd see
what effect that had. With 2 tasks, I got a time of 3.70 s. That's not a
significant speed up, about 9.1%.

Same compilation options and platform.

Here's that code:

with Ada.Numerics.Float_Random;
with Ada.Command_Line; use Ada.Command_Line;
with Ada.Text_IO; use Ada.Text_IO;
with Ada.Calendar; use Ada.Calendar;

procedure Tst_Array is
package F_IO is new Ada.Text_IO.Float_IO (Float);
package D_IO is new Ada.Text_Io.Fixed_Io (Duration);

N : constant Positive := Integer'Value (Argument (1) );

type Real_Matrix is array (1 .. N, 1 .. N) of Float;
pragma Convention (FORTRAN, Real_Matrix);

G : Ada.Numerics.Float_Random.Generator;

A, B : Real_Matrix :=

(others => (others => Ada.Numerics.Float_Random.Random (G) ) );
C : Real_Matrix := (others => (others => 0.0) );
Start, Finish : Ada.Calendar.Time;

procedure Multiply is
procedure Multiply
(Start_Row : in Positive; Stop_Row : in Positive)
is
Sum : Float;
begin -- Multiply
All_Rows : for Row in Start_Row .. Stop_Row loop
All_Columns : for Column in B'range (2) loop
Sum := 0.0;

All_Common : for R in A'range (2) loop
Sum := Sum + A (Row, R) * B (R, Column);
end loop All_Common;

C (Row, Column) := Sum;
end loop All_Columns;
end loop All_Rows;
end Multiply;

task type Multiplier (Start_Row : Positive; Stop_Row : Positive);

task body Multiplier is
-- null;
begin -- Multiplier
Multiply (Start_Row => Start_Row, Stop_Row => Stop_Row);
end Multiplier;

Stop : constant Positive := N / 2;
Start : constant Positive := Stop + 1;

Mult : Multiplier (Start_Row => 1, Stop_Row => Stop);
begin -- Multiply
Multiply (Start_Row => Start, Stop_Row => N);
end Multiply;
begin
Start := Ada.Calendar.Clock;
Multiply;
Finish := Ada.Calendar.Clock;

F_IO.Put (C (1, 1) );
F_IO.Put (C (1, 2) );
New_Line;
F_IO.Put (C (2, 1) );
F_IO.Put (C (2, 2) );
New_Line;

Put ("Time: ");
D_IO.Put (Finish - Start);
New_Line;
end Tst_Array;

If I inline the inner Multiply, or put equivalent code in the task and
the outer Mutliply, the time is much more than for the sequential
version, presumably due to cache effects.

Since it appears you have 2 physical processors ("Dual Xeon 2.8 Ghz"), I
would be interested in seeing what effect this concurrent version has on
that platform. I also wonder how easy such a version would be to create
in FORTRAN.

--
Jeff Carter
"Ada has made you lazy and careless. You can write programs in C that
are just as safe by the simple application of super-human diligence."
E. Robert Tisdale
72

tkrauss

unread,

Oct 22, 2006, 7:48:36 AM10/22/06

to

> If I inline the inner Multiply, or put equivalent code in the task and
> the outer Mutliply, the time is much more than for the sequential
> version, presumably due to cache effects.
>
> Since it appears you have 2 physical processors ("Dual Xeon 2.8 Ghz"), I
> would be interested in seeing what effect this concurrent version has on
> that platform. I also wonder how easy such a version would be to create
> in FORTRAN.

It's funny that you mention tasking since that's what got me started on
this in the first place. I was introduced to OpenMP as a simple method
of parallelizing code in fortran. Unfortunately it seems to be
"tricky" to really get things in parallel (what data is shared,
multiple tasks doing the same thing with the same memory, access
violations, etc.) I remembered that Ada had tasking built in so I
started playing with that. Now, as you can probably tell from my code,
I haven't touched Ada in a _very_ long time, but it was surprisingly
easy set up a simple two-task test program.

Anyway, using the latest code from Jeffrey Creem it looks like the
execution time (on my machine) has been cut in half (9 seconds) . The
threaded version runs in nearly the same time for smaller problems but
dies with a stack overflow for larger. I see a comment in the Bugzilla
recommending a similar construct

type Real_Matrix is array (1 .. N, 1 .. N) of
Float;

That takes the memory from the stack rather than the heap though, no?
I assume there is a compiler switch to increase the stack size so the
code wouldn't die, but is that the "normal" way of allocating
memory? I'm trying to not look like _too_ much of an Ada neophyte :)

Gautier

unread,

Oct 22, 2006, 8:31:05 AM10/22/06

to

Jeffrey R. Carter:

> Adding the Sum variable makes an important difference, as others have
> reported, in my case from 5.82 to 4.38 s. Hoisting the indexing
> calculation for the result (C) matrix location is a basic optimization,
> and I would be surprised if it isn't done. The only thing I can think of
> is that it's a cache issue: that all 3 matrices can't be kept in cache
> at once. Perhaps compiler writers would be able to make sense of this.

The Sum variable was *removed* by someone at some point of the discussion in
order to challenge a bit more the Ada compiler's optimizer.
If you replace a good algorithm by a bad one, don't be surprised that the
program is slow. At some point of bad coding the best code optimizer won't be
able to help you.
Eventually the optimizer will transform this:

for R in A'range (2) loop

C (I, J) := C (I, J) + A (I, R) * B (R, J);
end loop;
into something like:
Sum:= C (I, J); -- hopefully a Sum is mapped to a register

for R in A'range (2) loop

Sum := Sum + A (I, R) * B (R, J);
end loop;
C (I, J):= Sum;

but in that case it probably won't be able to guess that the C(I,J) was zeroed
before and replace the first line by:
Sum:= 0.0;
sparing the reading of C(I,J) (must cost much time...).
If you are luckier, the optimizer will do
Sum:= 0.0;

for R in A'range (2) loop

Sum := Sum + A (I, R) * B (R, J);
end loop;

C (I, J):= C (I, J) + Sum;
But still, it won't spare the time lost to fill the C matrix with zeros.
If you want to do a benchmark with Fortran, it's really not a good idea to
begin with "pessimizing" the Ada code.

Cheers

Gautier

Alinabi

unread,

Oct 22, 2006, 9:50:13 AM10/22/06

to

I ran your test programs compiled with gcc 4.0.3 and the following
optimizations:
COMMON_FLAGS=-g -march=opteron -mtune=opteron -mfpmath=sse
-fomit-frame-pointer -O2 -fdump-tree-optimized
and I cannot reproduce the large differences in performance everyone
else talks about. Here are the times I get:

N Ada Fortran
====================
64 0.002029 0.000000
128 0.016321 0.016000
256 0.214143 0.204013
512 3.125888 3.124195
800 6.374982 5.864366
1024 34.10479 35.22620
2048 277.3071 283.2417

Jeffrey Creem

unread,

Oct 22, 2006, 11:41:20 AM10/22/06

to

Alinabi wrote:
> I ran your test programs compiled with gcc 4.0.3 and the following
> optimizations:
> COMMON_FLAGS=-g -march=opteron -mtune=opteron -mfpmath=sse
> -fomit-frame-pointer -O2 -fdump-tree-optimized
> and I cannot reproduce the large differences in performance everyone
> else talks about. Here are the times I get:
>
> N Ada Fortran
> ====================
> 64 0.002029 0.000000
> 128 0.016321 0.016000
> 256 0.214143 0.204013
> 512 3.125888 3.124195
> 800 6.374982 5.864366
> 1024 34.10479 35.22620
> 2048 277.3071 283.2417
>

That is interesting. The question then becomes has FORTRAN improved on
the way to 4.2.0 or has Ada regressed.

Try doing a make dis_all which should produce annotated assembly output.
The Ada version can be a little daunting in the way we have setup the
files since the generic instantiations at the top full the .S files
(woops, looked like I named them .dis) with the generic instatance.

Even if you are not an assembly guru, if you start from the bottom of
the files you can usually pretty quickly find that inner loop and
compare the number of statements.

Jeffrey Creem

unread,

Oct 22, 2006, 11:57:16 AM10/22/06

to

Followup on the bug report.

One of the comments asserted that the two programs were not equivilent
though I am not yet 100% convinced that I believe it yet.

His recommendation was to remove a level of indirection by changing the
way the array is declared.

N : Positive := Positive'Value (Argument (1));
G : Ada.Numerics.Float_Random.Generator;

type Real_Matrix is array (1..N, 1..N) of Float;

type Matrix_Access is access Real_Matrix;

A,B,C : Matrix_Access;

Start, Finish : Ada.Calendar.Time;
Sum : Float := 0.0;
begin

A := new Real_Matrix;
B := new Real_Matrix;
C := new Real_Matrix;

This does indeed substantially improve the performance (still not quite
to FORTRAN levels). The reason I question the equilivence is that in the
original version, the array could have been passed to a procedure that
took in an unconstrained array (which is pretty much what I think the
FORTRAN version would allow) while this new version would not do that.

A quick check shows the FORTRAN is now only 1.2 times faster (did not do
the multiple run thing yet). Perhaps tonight.

I'll also attempt to take one of the threaded ones and include that as well.

Can someone that understands FORTRAN better make an argument about the
"closeness" of this approach v.s. the other?

Georg Bauhaus

unread,

Oct 22, 2006, 2:02:24 PM10/22/06

to

tkrauss wrote:

> That takes the memory from the stack rather than the heap though, no?
> I assume there is a compiler switch to increase the stack size so the
> code wouldn't die, but is that the "normal" way of allocating
> memory? I'm trying to not look like _too_ much of an Ada neophyte :)
>

I just tested a similar thing using two different compilers.
A function that uses a local array of a size determined by
a function parameter. The result was that

- GNAT may trigger constraint errors when the array size
exceeds some limit. The documentation tells the compiler
user that this is an operating system issue and should
be addressed at this level IIUC. (Does this comment
apply in this case?)

- ObjectAda had no problem with local arrays of the same
tens of millions of components.

I've measured performance by calling the function many times with
different Size parameters for the array. This has consistently shown
that on this platform, GNAT code takes about at least twice as
much time as OA code. (The time is not affected by -O?.)

Using an allocator "new Big_Array(1 .. Size)" and an instance
of Unchecked_Deallocation in place of simple variable allocation
has two effects.

- The execution time of ObjectAda code almost does not change.
This lets me suspect that OA isn't necessarily using the
primary stack for local variables. Indeed, the assembly
listing shows calls on rts_ss_allocate (for the local array variable)
versus rts_allocate (for an array allocator). The RTS function
rts_ss_allocate is the link_name of the function
System.RTS.TGT.Sec_Stack_Pkg.Allocate from the ObjectAda RTS.

- The "GNAT heap version" performs about 2-3 times faster than
the "GNAT stack version".

(GNAT code that uses "new" instead of a simple local variable
performs minimally to just slightly faster than the OA code using
"new".)

So there is a major difference in execution speed of programs that
compute larger matrices if you use GNAT and simple array variables,
but not allocators.
Execution speed does not differ significantly using at least one
other compiler (ObjectAda).

I think that's a pity when seen from a "just Ada" perspective. There
is one more thing (allocators) you have to think about if you
want speed for local array variables when using GNAT.

Windows x86, GNAT GPL 2006, and OA 8.2 U15. The first, middle
and last element of the array are referenced. Pragma volatile
has been tried, too.

tmo...@acm.org

unread,

Oct 22, 2006, 2:01:53 PM10/22/06

to

> ... the time is much more than for the sequential

> version, presumably due to cache effects.

Each 800x800 Float matrix is about 2.5 megabytes so I would expect two
threads to be fighting each other over the cache. What are the single vs
dual threaded times for, say, 250x250 arrays and a 1 MB cache on your
machine, and on the dual Xeon machine?
Curiously, on a 3 GHz Pentium 4 with 1 MB cache and compiling
with -O2 -gnatp using the venerable gnat 3.15p, I get
Time: 8.511855190 800 1 thread
Time: 5.682686332 800 2 threads
for a speedup of 33%
Time: 0.092724434 200 1 thread
Time: 0.059209520 200 2 threads
for a speedup of 36%

Jeffrey Creem

unread,

Oct 22, 2006, 2:24:25 PM10/22/06

to

Georg Bauhaus wrote:

>
> I think that's a pity when seen from a "just Ada" perspective. There
> is one more thing (allocators) you have to think about if you
> want speed for local array variables when using GNAT.
>
>
> Windows x86, GNAT GPL 2006, and OA 8.2 U15. The first, middle and last
> element of the array are referenced. Pragma volatile
> has been tried, too.

It sounds like you are running some different code and I'd be hestitant
to make any assertions about runtime for certain constructs without
seeing it since

1) We just spent 3 days talking about the code that started this thread
and there have been all sorts of assertions about things being
faster/slower,etc.

2) You mention something about just accessing first, middle and last of
your arrays so it really sounds like you really are just timing
allocations and and not actually really hitting the array indexing
(though hard to say without seeing the code).

Don't take either of those as any sort of severe criticism but I am just
trying to make sure that any discussions of performance are tied to
clear,open and repeatable measurements so we all know what each other
are talkig about.

The problem is that too often, performance critisim is done via vague
references rather than hard facts and thus they get dismissed
(sometimes) by some compiler writers.

The bug submitted on this issue (whether or not it ever gets fixed)
certainly has generated real and useful discussions amoung the
developers with few of the typical 'silly user' responses that we
sometimes see.

Damien Carbonne

unread,

Oct 22, 2006, 3:32:29 PM10/22/06

to

Jeffrey Creem a écrit :

> Followup on the bug report.
>
> One of the comments asserted that the two programs were not equivilent
> though I am not yet 100% convinced that I believe it yet.
>

<snip>

>
>
> Can someone that understands FORTRAN better make an argument about the
> "closeness" of this approach v.s. the other?

I've been following this thread for some times and I'm not a Fortran
Guru,however, IIRC, Fortran arrays and Ada arrays don't have the same
memory layout (Something like row major order and column major order).
I've not compared programs of this thread with care, but I've the
feeling that this could change things.
Some years ago, I had to mix Ada and Fortran, and I remember that we had
to invert loops order to obtain the same result.
One would need to add 'pragma Convention (Fortran, Real_Matrix)' on Ada
side to obtain the same result, or to exchange loops.
My knowledge of Fortran was limited to Fortran 77, and I don't know if
this is still valid with Fortran 95 or later.
But if is still true, I would not consider the 2 programs as equivalent.

Regards

Damien Carbonne

Gautier

unread,

Oct 22, 2006, 4:00:09 PM10/22/06

to

Damien Carbonne:

> I've been following this thread for some times and I'm not a Fortran
> Guru,however, IIRC, Fortran arrays and Ada arrays don't have the same
> memory layout (Something like row major order and column major order).
> I've not compared programs of this thread with care, but I've the
> feeling that this could change things.
> Some years ago, I had to mix Ada and Fortran, and I remember that we had
> to invert loops order to obtain the same result.
> One would need to add 'pragma Convention (Fortran, Real_Matrix)' on Ada
> side to obtain the same result, or to exchange loops.
> My knowledge of Fortran was limited to Fortran 77, and I don't know if
> this is still valid with Fortran 95 or later.
> But if is still true, I would not consider the 2 programs as equivalent.

The question of Fortran's array storage has been treated earlier in this
discussion, you just happened to add more noise by not reading enough posts
;-). Look at the post of Martin Krischik, 20.10.06 18:30 CET... it shows the
most serious testing so far, with the exact sources used for testing.

Jeffrey R. Carter

unread,

Oct 22, 2006, 4:20:41 PM10/22/06

to

tkrauss wrote:
>
> Anyway, using the latest code from Jeffrey Creem it looks like the
> execution time (on my machine) has been cut in half (9 seconds) . The
> threaded version runs in nearly the same time for smaller problems but
> dies with a stack overflow for larger. I see a comment in the Bugzilla
> recommending a similar construct
> type Real_Matrix is array (1 .. N, 1 .. N) of
> Float;
> That takes the memory from the stack rather than the heap though, no?
> I assume there is a compiler switch to increase the stack size so the
> code wouldn't die, but is that the "normal" way of allocating
> memory? I'm trying to not look like _too_ much of an Ada neophyte :)

With GNAT, at least, they're allocated on the stack. AFAIKT, the
language doesn't require that, but it's a common approach. The main
stack size is probably an OS thing.

--
Jeff Carter
"I unclog my nose towards you."
Monty Python & the Holy Grail
11

Jeffrey R. Carter

unread,

Oct 22, 2006, 4:26:41 PM10/22/06

to

Gautier wrote:
>
> The Sum variable was *removed* by someone at some point of the
> discussion in order to challenge a bit more the Ada compiler's optimizer.
> If you replace a good algorithm by a bad one, don't be surprised that
> the program is slow. At some point of bad coding the best code optimizer
> won't be able to help you.

I did that in my 1st version. I wanted to see if the optimizer would
result in equivalent code. No such luck.

> Eventually the optimizer will transform this:
> for R in A'range (2) loop
> C (I, J) := C (I, J) + A (I, R) * B (R, J);
> end loop;
> into something like:
> Sum:= C (I, J); -- hopefully a Sum is mapped to a register
> for R in A'range (2) loop
> Sum := Sum + A (I, R) * B (R, J);
> end loop;
> C (I, J):= Sum;
> but in that case it probably won't be able to guess that the C(I,J) was
> zeroed before and replace the first line by:
> Sum:= 0.0;

Thanks for the discussion.

The initialization of C is static, so a good optimizer could. They're
hard to find, though.

> sparing the reading of C(I,J) (must cost much time...).
> If you are luckier, the optimizer will do
> Sum:= 0.0;
> for R in A'range (2) loop
> Sum := Sum + A (I, R) * B (R, J);
> end loop;
> C (I, J):= C (I, J) + Sum;
> But still, it won't spare the time lost to fill the C matrix with zeros.

I didn't include that in the timing.

> If you want to do a benchmark with Fortran, it's really not a good idea
> to begin with "pessimizing" the Ada code.

I'm more interested in seeing what makes a difference in the Ada. In
this case, the high-level features that let you write less code.

Damien Carbonne

unread,

Oct 22, 2006, 4:51:01 PM10/22/06

to

Gautier a écrit :
> Damien Carbonne:
<snip>

>
>
> The question of Fortran's array storage has been treated earlier in this
> discussion, you just happened to add more noise by not reading enough
> posts ;-). Look at the post of Martin Krischik, 20.10.06 18:30 CET... it
> shows the most serious testing so far, with the exact sources used for
> testing.

Sorry for this. I'll go back to my silent lazy listening after this
message :-) S. Tardieu had also suggested this layout issue in one
message and I didn't follow the links given by M. Krischik. Anyway,
independently of compiler bugs or limitations, differences in matrix
layout leads to different (non strictly equivalent) programs! Which was
precisely a question asked by J. Creem in his first sentence !

Regards
Damien

Jeffrey R. Carter

unread,

Oct 22, 2006, 4:54:40 PM10/22/06

to

tmo...@acm.org wrote:
> Each 800x800 Float matrix is about 2.5 megabytes so I would expect two
> threads to be fighting each other over the cache. What are the single vs
> dual threaded times for, say, 250x250 arrays and a 1 MB cache on your
> machine, and on the dual Xeon machine?
> Curiously, on a 3 GHz Pentium 4 with 1 MB cache and compiling
> with -O2 -gnatp using the venerable gnat 3.15p, I get
> Time: 8.511855190 800 1 thread
> Time: 5.682686332 800 2 threads
> for a speedup of 33%
> Time: 0.092724434 200 1 thread
> Time: 0.059209520 200 2 threads
> for a speedup of 36%

On my 3.2 GHz P4 HT (best of 3 runs), I get

Time: 0.170563050 250 1 thread
Time: 0.134637140 250 2 threads

for a speedup of 24%.

Time: 0.113739732 200 1 thread
Time: 0.070599932 200 2 threads

for a speedup of 36%.

I'm still only getting about 10% speedup for 800. Is it possible my
cache is smaller?

Gautier

unread,

Oct 22, 2006, 5:22:26 PM10/22/06

to

Jeffrey R. Carter:

>> But still, it won't spare the time lost to fill the C matrix with zeros.
>
> I didn't include that in the timing.

But you have to! Otherwise you are cheating. The zeroing is a part of your
changed algorithm; without zeroing the result is wrong.

>> If you want to do a benchmark with Fortran, it's really not a good
>> idea to begin with "pessimizing" the Ada code.
>
> I'm more interested in seeing what makes a difference in the Ada. In
> this case, the high-level features that let you write less code.

It is less code... "on the paper". In fact, it is more work, or, at best, if
the code optimizer is really smart enough to combine the ":= (others =>
(others => 0.0) );" with the loops, it is the same work. Maybe that's the
problem with high-level features: one can zero a matrix in so few characters
that it seems to be done instantly ;-).

Cheers
Gautier

Alinabi

unread,

Oct 22, 2006, 8:02:13 PM10/22/06

to

Ok, it appears that the score is Fortran: 13 instructions -- Ada: 17
instructions in the inner loop.
Aggain, this is compiled with gcc 4.0.3 with the following switches:

-g -march=opteron -mtune=opteron -mfpmath=sse -fomit-frame-pointer -O2
-fdump-tree-optimized

Fortran inner loop (13 instructions):
=============
3d4: 48 63 c1 movslq %ecx,%rax
3d7: ff c1 inc %ecx
3d9: 48 89 c2 mov %rax,%rdx
3dc: 49 0f af c2 imul %r10,%rax
3e0: 48 0f af d3 imul %rbx,%rdx
3e4: 4c 01 c8 add %r9,%rax
3e7: 48 01 f0 add %rsi,%rax
3ea: 48 8d 14 17 lea (%rdi,%rdx,1),%rdx
3ee: 44 39 c1 cmp %r8d,%ecx
3f1: f3 41 0f 10 04 83 movss (%r11,%rax,4),%xmm0
3f7: f3 41 0f 59 04 94 mulss (%r12,%rdx,4),%xmm0
3fd: f3 0f 58 c8 addss %xmm0,%xmm1
401: 75 d1 jne 3d4 <MAIN__+0x3d4>

Ada inner loop (17 instructions):
==============
a90: ff c6 inc %esi
a92: 48 63 d6 movslq %esi,%rdx
a95: 48 89 d0 mov %rdx,%rax
a98: 48 29 e8 sub %rbp,%rax
a9b: 4c 01 d8 add %r11,%rax
a9e: f3 41 0f 10 0c 84 movss (%r12,%rax,4),%xmm1
aa4: 4c 89 c0 mov %r8,%rax
aa7: 4c 29 d2 sub %r10,%rdx
aaa: 45 31 c9 xor %r9d,%r9d
aad: 48 83 c0 04 add $0x4,%rax
ab1: 49 0f 48 c1 cmovs %r9,%rax
ab5: 48 0f af d0 imul %rax,%rdx
ab9: f3 0f 10 04 17 movss (%rdi,%rdx,1),%xmm0
abe: f3 0f 59 c1 mulss %xmm1,%xmm0
ac2: f3 0f 58 d0 addss %xmm0,%xmm2
ac6: 39 de cmp %ebx,%esi
ac8: 75 c6 jne a90 <_ada_tst_array+0x3a0>

Now, since Jeffrey was right about me not being an assembly guru, here
is the assembly code for all three nested loops, so that you can
doublecheck yourself.

Fortran:
========
do I = 1,N
2f2: 8b 84 24 ac 01 00 00 mov 0x1ac(%rsp),%eax
2f9: f3 0f 11 44 24 28 movss %xmm0,0x28(%rsp)
2ff: 85 c0 test %eax,%eax
301: 0f 8e 28 01 00 00 jle 42f <MAIN__+0x42f>
do J = 1,N
sum = 0.0
do R = 1,N
sum = sum + A(I,R)*B(R,J)
307: 48 8b 8c 24 28 01 00 mov 0x128(%rsp),%rcx
30e: 00
30f: 48 8b 94 24 18 01 00 mov 0x118(%rsp),%rdx
316: 00
317: 44 8d 40 01 lea 0x1(%rax),%r8d
31b: 4c 8b a4 24 10 01 00 mov 0x110(%rsp),%r12
322: 00
323: 48 8b 9c 24 40 01 00 mov 0x140(%rsp),%rbx
32a: 00
32b: 4c 8b 9c 24 60 01 00 mov 0x160(%rsp),%r11
332: 00
333: 4c 8b 94 24 78 01 00 mov 0x178(%rsp),%r10
33a: 00
33b: 48 89 4c 24 10 mov %rcx,0x10(%rsp)
340: 48 89 54 24 18 mov %rdx,0x18(%rsp)
345: 48 8b 8c 24 90 01 00 mov 0x190(%rsp),%rcx
34c: 00
end do
C(I,J) = sum
34d: 48 8b 94 24 c0 00 00 mov 0xc0(%rsp),%rdx
354: 00
355: 4c 8b 8c 24 68 01 00 mov 0x168(%rsp),%r9
35c: 00
35d: 4c 8b bc 24 f0 00 00 mov 0xf0(%rsp),%r15
364: 00
365: 0f 57 d2 xorps %xmm2,%xmm2
368: c7 44 24 30 01 00 00 movl $0x1,0x30(%rsp)
36f: 00
370: 48 89 4c 24 20 mov %rcx,0x20(%rsp)
375: 48 89 54 24 48 mov %rdx,0x48(%rsp)
37a: 48 8b 8c 24 d8 00 00 mov 0xd8(%rsp),%rcx
381: 00
382: 48 8b 94 24 c8 00 00 mov 0xc8(%rsp),%rdx
389: 00
38a: 48 89 4c 24 40 mov %rcx,0x40(%rsp)
38f: 48 89 54 24 38 mov %rdx,0x38(%rsp)
394: 48 63 44 24 30 movslq 0x30(%rsp),%rax
399: 48 8b 54 24 10 mov 0x10(%rsp),%rdx
39e: 41 bd 01 00 00 00 mov $0x1,%r13d
3a4: 48 8b 4c 24 18 mov 0x18(%rsp),%rcx
3a9: 48 0f af d0 imul %rax,%rdx
3ad: 48 0f af 44 24 40 imul 0x40(%rsp),%rax
3b3: 48 8d 3c 0a lea (%rdx,%rcx,1),%rdi
3b7: 48 8b 54 24 38 mov 0x38(%rsp),%rdx
3bc: 4c 8d 34 10 lea (%rax,%rdx,1),%r14
3c0: 48 8b 74 24 20 mov 0x20(%rsp),%rsi
3c5: 49 63 ed movslq %r13d,%rbp
3c8: b9 01 00 00 00 mov $0x1,%ecx
3cd: 0f 28 ca movaps %xmm2,%xmm1
3d0: 48 0f af f5 imul %rbp,%rsi
3d4: 48 63 c1 movslq %ecx,%rax
3d7: ff c1 inc %ecx
3d9: 48 89 c2 mov %rax,%rdx
3dc: 49 0f af c2 imul %r10,%rax
3e0: 48 0f af d3 imul %rbx,%rdx
3e4: 4c 01 c8 add %r9,%rax
3e7: 48 01 f0 add %rsi,%rax
3ea: 48 8d 14 17 lea (%rdi,%rdx,1),%rdx
3ee: 44 39 c1 cmp %r8d,%ecx
3f1: f3 41 0f 10 04 83 movss (%r11,%rax,4),%xmm0
3f7: f3 41 0f 59 04 94 mulss (%r12,%rdx,4),%xmm0
3fd: f3 0f 58 c8 addss %xmm0,%xmm1
401: 75 d1 jne 3d4 <MAIN__+0x3d4>
403: 4c 89 f8 mov %r15,%rax
406: 48 8b 54 24 48 mov 0x48(%rsp),%rdx
40b: 41 ff c5 inc %r13d
40e: 48 0f af c5 imul %rbp,%rax
412: 41 39 cd cmp %ecx,%r13d
415: 49 8d 04 06 lea (%r14,%rax,1),%rax
419: f3 0f 11 0c 82 movss %xmm1,(%rdx,%rax,4)
41e: 75 a0 jne 3c0 <MAIN__+0x3c0>
420: ff 44 24 30 incl 0x30(%rsp)
424: 44 39 6c 24 30 cmp %r13d,0x30(%rsp)
429: 0f 85 65 ff ff ff jne 394 <MAIN__+0x394>
end do
end do

Ada:
=============
for I in A'range(1) loop
9a5: 41 8b 45 00 mov 0x0(%r13),%eax
9a9: 41 8b 55 04 mov 0x4(%r13),%edx
9ad: 39 d0 cmp %edx,%eax
9af: 0f 8f 8a 01 00 00 jg b3f <_ada_tst_array+0x44f>
for J in A'range(2) loop
Sum := 0.0;
for R in A'range(2) loop

Sum := Sum + A(I,R)*B(R,J);

9b5: 4c 8b bc 24 d0 00 00 mov 0xd0(%rsp),%r15
9bc: 00
end loop;
C(I,J) := Sum;
9bd: 48 8b 8c 24 c0 00 00 mov 0xc0(%rsp),%rcx
9c4: 00
9c5: 4c 8b a4 24 e0 00 00 mov 0xe0(%rsp),%r12
9cc: 00
9cd: 89 44 24 68 mov %eax,0x68(%rsp)
9d1: 4c 89 7c 24 50 mov %r15,0x50(%rsp)
9d6: 48 89 8c 24 88 00 00 mov %rcx,0x88(%rsp)
9dd: 00
9de: 45 8b 75 08 mov 0x8(%r13),%r14d
9e2: 45 8b 6d 0c mov 0xc(%r13),%r13d
9e6: 44 89 6c 24 6c mov %r13d,0x6c(%rsp)
9eb: 4c 63 7c 24 6c movslq 0x6c(%rsp),%r15
9f0: 48 63 f8 movslq %eax,%rdi
9f3: 0f 57 e4 xorps %xmm4,%xmm4
9f6: 48 89 7c 24 20 mov %rdi,0x20(%rsp)
9fb: 49 63 ee movslq %r14d,%rbp
9fe: 89 54 24 2c mov %edx,0x2c(%rsp)
a02: 44 89 eb mov %r13d,%ebx
a05: 4c 89 7c 24 30 mov %r15,0x30(%rsp)
a0a: 44 3b 74 24 6c cmp 0x6c(%rsp),%r14d
a0f: 0f 8f 17 01 00 00 jg b2c <_ada_tst_array+0x43c>
a15: 48 8b 44 24 30 mov 0x30(%rsp),%rax
a1a: ba 00 00 00 00 mov $0x0,%edx
a1f: 45 89 f5 mov %r14d,%r13d
a22: 0f 28 dc movaps %xmm4,%xmm3
a25: 48 8b 4c 24 40 mov 0x40(%rsp),%rcx
a2a: 48 29 e8 sub %rbp,%rax
a2d: 48 8d 04 85 04 00 00 lea 0x4(,%rax,4),%rax
a34: 00
a35: 48 85 c0 test %rax,%rax
a38: 48 0f 48 c2 cmovs %rdx,%rax
a3c: 48 63 54 24 68 movslq 0x68(%rsp),%rdx
a41: 48 c1 f8 02 sar $0x2,%rax
a45: 48 89 54 24 70 mov %rdx,0x70(%rsp)
a4a: 48 2b 54 24 20 sub 0x20(%rsp),%rdx
a4f: 49 89 d3 mov %rdx,%r11
a52: 4c 0f af d8 imul %rax,%r11
a56: 8b 01 mov (%rcx),%eax
a58: 4c 63 d0 movslq %eax,%r10
a5b: 8b 41 0c mov 0xc(%rcx),%eax
a5e: 48 98 cltq
a60: 8b 51 08 mov 0x8(%rcx),%edx
a63: 48 63 d2 movslq %edx,%rdx
a66: 48 29 d0 sub %rdx,%rax
a69: 48 89 14 24 mov %rdx,(%rsp)
a6d: 4c 8d 04 85 00 00 00 lea 0x0(,%rax,4),%r8
a74: 00
a75: 49 63 cd movslq %r13d,%rcx
a78: 4c 8b 7c 24 50 mov 0x50(%rsp),%r15
a7d: 44 89 f6 mov %r14d,%esi
a80: 48 89 c8 mov %rcx,%rax
a83: 48 2b 04 24 sub (%rsp),%rax
a87: 0f 28 d3 movaps %xmm3,%xmm2
a8a: 49 8d 3c 87 lea (%r15,%rax,4),%rdi
a8e: eb 02 jmp a92 <_ada_tst_array+0x3a2>
a90: ff c6 inc %esi
a92: 48 63 d6 movslq %esi,%rdx
a95: 48 89 d0 mov %rdx,%rax
a98: 48 29 e8 sub %rbp,%rax
a9b: 4c 01 d8 add %r11,%rax
a9e: f3 41 0f 10 0c 84 movss (%r12,%rax,4),%xmm1
aa4: 4c 89 c0 mov %r8,%rax
aa7: 4c 29 d2 sub %r10,%rdx
aaa: 45 31 c9 xor %r9d,%r9d
aad: 48 83 c0 04 add $0x4,%rax
ab1: 49 0f 48 c1 cmovs %r9,%rax
ab5: 48 0f af d0 imul %rax,%rdx
ab9: f3 0f 10 04 17 movss (%rdi,%rdx,1),%xmm0
abe: f3 0f 59 c1 mulss %xmm1,%xmm0
ac2: f3 0f 58 d0 addss %xmm0,%xmm2
ac6: 39 de cmp %ebx,%esi
ac8: 75 c6 jne a90 <_ada_tst_array+0x3a0>
aca: 48 8b 54 24 48 mov 0x48(%rsp),%rdx
acf: 8b 02 mov (%rdx),%eax
ad1: 48 63 d0 movslq %eax,%rdx
ad4: 48 8b 7c 24 48 mov 0x48(%rsp),%rdi
ad9: 8b 47 0c mov 0xc(%rdi),%eax
adc: 48 63 f8 movslq %eax,%rdi
adf: 4c 8b 7c 24 48 mov 0x48(%rsp),%r15
ae4: 41 8b 47 08 mov 0x8(%r15),%eax
ae8: 48 98 cltq
aea: 4c 8b 7c 24 70 mov 0x70(%rsp),%r15
aef: 48 29 c7 sub %rax,%rdi
af2: 48 29 c1 sub %rax,%rcx
af5: 48 8d 04 bd 04 00 00 lea 0x4(,%rdi,4),%rax
afc: 00
afd: 49 29 d7 sub %rdx,%r15
b00: 48 85 c0 test %rax,%rax
b03: 4c 89 fa mov %r15,%rdx
b06: 49 0f 48 c1 cmovs %r9,%rax
b0a: 48 0f af d0 imul %rax,%rdx
b0e: 48 8b 84 24 88 00 00 mov 0x88(%rsp),%rax
b15: 00
b16: 48 8d 0c 88 lea (%rax,%rcx,4),%rcx
b1a: f3 0f 11 14 11 movss %xmm2,(%rcx,%rdx,1)
b1f: 41 39 f5 cmp %esi,%r13d
b22: 74 08 je b2c <_ada_tst_array+0x43c>
b24: 41 ff c5 inc %r13d
b27: e9 49 ff ff ff jmpq a75 <_ada_tst_array+0x385>
b2c: 8b 54 24 2c mov 0x2c(%rsp),%edx
b30: 39 54 24 68 cmp %edx,0x68(%rsp)
b34: 74 09 je b3f <_ada_tst_array+0x44f>
b36: ff 44 24 68 incl 0x68(%rsp)
b3a: e9 cb fe ff ff jmpq a0a <_ada_tst_array+0x31a>
end loop;
end loop;

Georg Bauhaus

unread,

Oct 22, 2006, 8:10:53 PM10/22/06

to

On Sun, 2006-10-22 at 14:24 -0400, Jeffrey Creem wrote:

> It sounds like you are running some different code and I'd be hestitant
> to make any assertions about runtime for certain constructs without
> seeing it since

It is appended below.

> 2) You mention something about just accessing first, middle and last of
> your arrays so it really sounds like you really are just timing
> allocations and and not actually really hitting the array indexing
> (though hard to say without seeing the code).

Yes, I had concentrated on just this question, asked by Thomas Krauss,
about stack versus heap allocation. If a program is not frequently
allocating and freeing arrays (or matrices, etc.), the the effects
might be less of an issue, if they are an issue at all when there
is no initialization.
But I imagined allocation is just what is happening all the time
if you have a function that computes matrices of varying sizes?
OTOH, I find very different results on GNU/Linux (for which I only
have a number of GNATs). Stack allocation appears to be consuming
next to no time. (Commit on write? Just remembering an offset?)

Below the following program, there is a patch that adds trivial
initialization loops. With them, a comparison of stack vs heap
on GNU/Linux seems in favor of the stack. I can't check this on
Windows right now, though.

with Ada.Calendar;
with Ada.Text_IO;
with Ada.Unchecked_Deallocation;

procedure main is
use Ada.Calendar, Ada;

type LIST is array (NATURAL range <>) of BOOLEAN;
for LIST'component_size use 8;
-- for a "normal" array, not packed or other magic

type LIST_PTR is access LIST;
procedure free is new Ada.Unchecked_Deallocation
(LIST, LIST_PTR);

accu: BOOLEAN := false;
-- The allocating functions read and write this variable
-- using components of the local arrays.
-- (This should prevent some optimizations.)

function allocate(size: POSITIVE) return BOOLEAN;
-- use a local `LIST` of length `size`

function allocate_heap(size: POSITIVE) return BOOLEAN;
-- use a pointer to a new `LIST` of length `size`

function image(t: TIME) return STRING;
-- the current time as a `STRING` value

function allocate(size: POSITIVE) return BOOLEAN is
done: LIST(1 .. size); -- pragma volatile(done);
result: BOOLEAN;
begin
if done(size / 2) then
result := false;
end if;
done(done'last) := accu and done(done'first);
result := done(done'last) and done(done'first);
return result;
end allocate;

function allocate_heap(size: POSITIVE) return BOOLEAN is
done: LIST_PTR;
result: BOOLEAN;
begin
done := new LIST(1 .. size);
if done(size / 2) then
result := false;
end if;
done(done'last) := accu and done(done'first);
result := done(done'first) and done(done'first);
Free(done);
return result;
end allocate_heap;

function image(t: TIME) return STRING is
year: YEAR_NUMBER;
day: DAY_NUMBER;
month: MONTH_NUMBER;
sec: DURATION;
begin
split(t, year, month, day, sec);
return YEAR_NUMBER'image(year)
& MONTH_NUMBER'image(month)
& DAY_NUMBER'image(day)
& DURATION'image(sec);
end image;

start, finish: TIME;

begin
Text_IO.put_line("using a stack");
start := clock;
for run in 1 .. 25 loop
for k in 1 .. 2000 loop
accu := allocate(5000 * k);
end loop;
end loop;
finish := clock;

Text_IO.put_line("from" & image(start)
& " to" & image(finish)
& " = " & DURATION'image(finish - start));
Text_IO.put_line("accu " & BOOLEAN'image(accu));

Text_IO.put_line("using a heap");
start := clock;
for run in 1 .. 25 loop
for k in 1 .. 2000 loop
accu := allocate_heap(5000 * k);
end loop;
end loop;
finish := clock;

Text_IO.put_line("from" & image(start)
& " to" & image(finish)
& " = " & DURATION'image(finish - start));
Text_IO.put_line("accu " & BOOLEAN'image(accu));
end main;

--- stack_use_testing.ada 2006/10/22 23:51:48 1.1
+++ stack_use_testing.ada 2006/10/22 23:52:12
@@ -33,2 +33,3 @@
begin
+ for k in done'range loop done(k) := false; end loop;
if done(size / 2) then
@@ -46,2 +47,3 @@
done := new LIST(1 .. size);
+ for k in done'range loop done(k) := false; end loop;
if done(size / 2) then
@@ -75,4 +77,4 @@
start := clock;
- for run in 1 .. 25 loop
- for k in 1 .. 2000 loop
+ for run in 1 .. 2 loop
+ for k in 1 .. 200 loop
accu := allocate(5000 * k);
@@ -90,4 +92,4 @@
start := clock;
- for run in 1 .. 25 loop
- for k in 1 .. 2000 loop
+ for run in 1 .. 2 loop
+ for k in 1 .. 200 loop
accu := allocate_heap(5000 * k);

Jeffrey Creem

unread,

Oct 22, 2006, 9:31:09 PM10/22/06

to

Yes, there is always the chance of row major/column major issues
creeping in and causing dramatically different cache results however the
original poster seemed to have tried the swap to no effect, and the code
assembly code appears to show that this is/was not the problem.

Jeffrey Creem

unread,

Oct 22, 2006, 10:15:10 PM10/22/06

to

Actually, feel free to chime in even if you are not sure you are adding
value. Heck, if we all waited until we were adding value, we'd never
post anything :)

Part problem here was my poor wording of the question.

What I was asking is whether the two programs were sematically
equivilent without dropping down to specific issues of storage
representation.

Specifically, one of the biggest speedups posted (and suggested by one
person on the bugzilla list) changes the Ada program so it is no longer
a pointer to an unconstrained array but rather a pointer to a
constrained array where the constrained array type definition is
determined at run time based on the size of the command line parameter.

This may seem like the same thing but where I am asserting this falls
apart is if someone wanted to write code that operated on variable sized
arrays, this approach would probably not work (i.e. dynamic creation of
a fixed length array).

It appears to me that the FORTRAN version is using as dynamic an array
size as possible and further that it is possible to pass these
dynamically sized arrays to other procedures (for example matmul). Now,
matmul may be some sort of special case because it appears to be a
compiler intrinsic on most compilers. I've poked around the web
resources a bit for FORTRAN 90/95 tutorials but I have not found the
specific examples that would convince me I know what I am talking about.

I guess what I am asking (or perhaps asserting) is that the original
version of the Ada program is the closest to the FORTRAN version and not
the one that creates the fixed length array since it appears to be that
allocatable arrays of varying sizes can be passed to fortran
subroutines. The original version of the Ada maintained that property.
The modified version (and its derivatives that set the size in the type
rather in the new) do not appear to me to be equivilent programs.

The reason this is important is that in general I don't think we are
trying to solve the "write the fastest program that can calculate a
matrix multiply" but rather investigate the way in which fairly simple
programming idioms are translated into assembly language and understand
their runtime implications (at least that is what my goal has been in this).

Jeffrey R. Carter

unread,

Oct 22, 2006, 10:29:13 PM10/22/06

to

Jeffrey Creem wrote:
>
> Specifically, one of the biggest speedups posted (and suggested by one
> person on the bugzilla list) changes the Ada program so it is no longer
> a pointer to an unconstrained array but rather a pointer to a
> constrained array where the constrained array type definition is
> determined at run time based on the size of the command line parameter.
>
> This may seem like the same thing but where I am asserting this falls
> apart is if someone wanted to write code that operated on variable sized
> arrays, this approach would probably not work (i.e. dynamic creation of
> a fixed length array).

You should be able to get the best of both worlds:

N : constant Positive := Integer'Value (Argument (1) );

type Base_Matrix is array (Positive range <>, Positive range <>)
of Float;

subtype Real_Matrix is Base_Matrix (1 .. N, 1 .. N);

Jeffrey Creem

unread,

Oct 22, 2006, 11:10:03 PM10/22/06

to

Ok, a lot of suggestions have been posted including a few that I have
not addressed or even investigated (specifically, the post that talked
about the assembly language output from gcc 4.0.3 which makes this
appear (to me) to be a regression.

Having said that, I did update the wiki site with some timing numbers
people asked for from a dual processor machine.

http://gnuada.sourceforge.net/pmwiki.php/Main/Oct2006CLAGFORTRANComparison

Scroll to the bottom and start reading at "Other Versions" if you are
interested. This is getting too messy for plain text conversation.

Gautier

unread,

Oct 23, 2006, 1:28:33 AM10/23/06

to

Alinabi:

> -g -march=opteron -mtune=opteron -mfpmath=sse -fomit-frame-pointer -O2
> -fdump-tree-optimized

Did you give -funroll-loops a try ?

G.

Martin Krischik

unread,

Oct 23, 2006, 2:28:15 AM10/23/06

to

Gautier schrieb:
> Jeffrey Creem:
>
>> Note, I am the first one to jump to the defense of "Ada" in general
>> but in this case, GNAT just plain sucks compared to GNU FORTRAN as it
>> does a poor job on (at least) the inner loop (verified by looking at
>> the output assembly)
>
> There is something strange... Martin Krischik was able to trim the
> overall time for the Ada code down to 24% of the first version (GNAT/GCC
> 4.1.1).
> This should make the Ada program as fast as the FORTRAN one, shouldn't it ?
> Maybe it's because the test is done on a 64 bit machine ?

A 64 bit system doesn't need -march -mcpu and -msse2 as desperately. Try

>gcc -dumpmachine
i686-pc-cygwin

For this system i686 is default.

Martin

Jeffrey R. Carter

unread,

Oct 23, 2006, 3:31:55 AM10/23/06

to

Jeffrey Creem wrote:
>
> Having said that, I did update the wiki site with some timing numbers
> people asked for from a dual processor machine.
>
> http://gnuada.sourceforge.net/pmwiki.php/Main/Oct2006CLAGFORTRANComparison
>
> Scroll to the bottom and start reading at "Other Versions" if you are
> interested. This is getting too messy for plain text conversation.

Wow, a greater than 50% speedup from using 2 tasks instead of 1 (from
ada/tst_array to ada/tst_array_task_2). Something is odd there.

This shows that, for those with dual-processor machines, easy-to-create
Ada is faster than easy-to-create FORTRAN.

This doesn't help those with single processors, of course.

I doubt if anything will beat matmul (short of 640_000 processors). But
Ada with explicit loops and FORTRAN with matmul are hardly equivalent.
You need to convention-FORTRAN the Ada array type, import matmul, and
call it to get a fair comparison. There shouldn't be much difference.

Jeffrey Creem

unread,

Oct 23, 2006, 7:55:49 AM10/23/06

to

I looked at the gcc FORTRAN matmul. Unless there some additional
trickery going on behind the scenes, it is not anything magical. It
looks like a matmul implemented in C with manual array subscripting
logic (i.e. uses a single dimensional array overlay)..

In any case, it is not so much matmul I am trying to make faster here
but rather just the nature of 2d array traversals in native language
structures.

I just included the FORTRAN matmul to be "more than fair" to FORTRAN as
I am in no way trying to bash FORTRAN.

The speedup for the tasks is quite odd though. I'll need to disassemble
it tonight.

I also just finished a 4.0.2 install last night so I'll get those
numbers to see if all of this mess is simply a regression someplace in
the compiler.

Warner BRUNS

unread,

Oct 23, 2006, 8:33:37 AM10/23/06

to

A problem with Ada is that the memory layout of a twodimensional
array is not specified in the Ada Standard.
If you want to have a uniform memory layout between different Ada compilers,
you have to specify that the memory layout should be like Fortran would do it.
Very odd.

TYPE DefReal IS DIGITS 12;
TYPE DefReal2D IS ARRAY(DefInt RANGE <>,
DefInt RANGE <>) OF DefReal;
PRAGMA CONVENTION (Fortran, DefReal2D);

Warner

Warner BRUNS

unread,

Oct 23, 2006, 8:40:56 AM10/23/06

to

A problem with Ada is that the memory layout of multidimensional arrays
is not specified in the standard. If you want to have the same memory
layout for all Ada compilers, you have to specify that a multimensional
array should follow the convention of a foreign language, eg. Fortran.

TYPE DefReal IS DIGITS 12;

TYPE DefReal2D IS ARRAY(Integer RANGE <>,
Integer RANGE <>) OF DefReal;
PRAGMA CONVENTION (Fortran, DefReal2D);

There is something missing in the Ada standard.

Warner

Georg Bauhaus

unread,

Oct 23, 2006, 9:52:00 AM10/23/06

to

On Mon, 2006-10-23 at 14:40 +0200, Warner BRUNS wrote:

> A problem with Ada is that the memory layout of multidimensional arrays
> is not specified in the standard. If you want to have the same memory
> layout for all Ada compilers, you have to specify that a multimensional
> array should follow the convention of a foreign language, eg. Fortran.

Will a memory layout cast in source really help with e.g. all
Ada X Fortran combinations of compilers and libraries?
It might disallow "knowledgeable" compilers to choose a layout
that suits the current setup, I'd think, and require touching
the sources.

-- Georg

Grein, Christoph (Fa. ESG)

unread,

Oct 23, 2006, 8:46:55 AM10/23/06

to Warner BRUNS, comp.l...@ada-france.org

Warner BRUNS claimed:

> A problem with Ada is that the memory layout of a twodimensional array
is
> not specified in the Ada Standard.

That's wrong:
Ada is row-first order (1, 1) (1, 2) etc.,
Fortran specifies column-first (1, 2) (2, 1) etc.

This is what the pragma Convention is for.

Robert A Duff

unread,

Oct 23, 2006, 11:02:10 AM10/23/06

to

Warner BRUNS <Warner...@cern.ch> writes:

> A problem with Ada is that the memory layout of multidimensional arrays
> is not specified in the standard. If you want to have the same memory
> layout for all Ada compilers, you have to specify that a multimensional
> array should follow the convention of a foreign language, eg. Fortran.
> TYPE DefReal IS DIGITS 12;
> TYPE DefReal2D IS ARRAY(Integer RANGE <>,
> Integer RANGE <>) OF DefReal;
> PRAGMA CONVENTION (Fortran, DefReal2D);

In practise, you can count on the fact that all Ada compilers will
choose row-major order, unless you ask for column-major via the above
pragma, even though the RM does not say so.

> There is something missing in the Ada standard.

Perhaps.

- Bob

Alinabi

unread,

Oct 23, 2006, 12:32:24 PM10/23/06

to

I did, and it made no difference. I think no loop unrolling is done in
the part of the code which is being timed, as the number of iterations
is not known at compile time.

Warner BRUNS

unread,

Oct 23, 2006, 1:06:50 PM10/23/06

to

I might be wrong, but my understanding is that the memory layout is not specified
in the standard. Can you point me to the section of the standard which specifies
the memory layout of a multidimensional array?
To me it seems as if you can only count on a specific memory layout if you
specify it to be the layout of a non-Ada languag, eg Fortran.
Strange indeed for a standard.

Warner

Warner BRUNS

unread,

Oct 23, 2006, 1:11:51 PM10/23/06

to

It will help with all Ada/ Fortran Compiler combinations,
but the really strange thing is that the memory layout is not
definitely known from the Ada source alone, if it is not specified
as being the layout as required by a different language, eg. Fortran.
If one wants to code a cache aware algorithm for eg. multiplying
matrices, one cannot do this in Ada without enforcing the memory
layout of the matrices via PRAGMA CONVENTION.

Warner

tmo...@acm.org

unread,

Oct 23, 2006, 1:42:42 PM10/23/06

to

>I might be wrong, but my understanding is that the memory layout is not specified
>in the standard. Can you point me to the section of the standard which specifies
>the memory layout of a multidimensional array?

ARM 3.6.2(11) "An implementation should normally represent a
multidimensional array in row-major order, consistent with the notation
used for multidimensional array aggregates (see 4.3.3). However, if a
Pragma Convention(Fortran, ...) applies to a multidimensional array type,
then column-major order should be used instead ..."
It would be strange to include powerful rep spec capabilities to define
everything about layouts, but not how an array is stored.

Dr. Adrian Wrigley

unread,

Oct 23, 2006, 1:57:01 PM10/23/06

to

A agree that it seems like an omission from the Ada standard.
This specification should be available using a representation
clause. Telling the compiler to allocate it like Fortran
would is the wrong way, logically.

Of course, if it really mattered, you could use an array of arrays
(like in C), or an array of array accesses. With appropriate
use of pragma representation you have a very high degree of flexibility.

Compare this to almost any other language, and you find Ada leads
the field by a mile. What if you wanted an alternative layout
in Fortran? Or C++?

Sometimes I'd like to see some intermingling of row/column ordering,
so a page in memory holds a square sub-array of a 2-D array.
The approach I'll take (if I get around to it!) is to use a
2-D array of acceses, allocated (by a pool allocator) with the
desired pattern. Moral of the story: One Size does not Fit All...
--
Adrian

Robert A Duff

unread,

Oct 23, 2006, 3:14:33 PM10/23/06

to

tmo...@acm.org writes:

> ARM 3.6.2(11) "An implementation should normally represent a

> multidimensional array in row-major order, ...

Ah, I had forgotten that. Thanks.

When the RM says "should", you can usually count on it being true in all
serious Ada compilers. So why doesn't it say "shall"? Well, it's not
really a formal requirement -- you can't write a test case that can
tell for sure which way arrays are stored. Nonetheless, I think you
can count on this rule being obeyed -- compiler writers don't
deliberately go out of their way to cause trouble for their customers.

I certainly do not think you should start putting Convention(Fortran) on
all your performance-critical arrays, just to make sure!

- Bob

Wiljan Derks

unread,

Oct 23, 2006, 3:52:59 PM10/23/06

to

> Jeffrey R. Carter wrote:
> I looked at the gcc FORTRAN matmul. Unless there some additional trickery
> going on behind the scenes, it is not anything magical. It looks like a
> matmul implemented in C with manual array subscripting logic (i.e. uses a
> single dimensional array overlay)..

When going for a faster algorithm this might be usefull information.
Notice that this matmul uses an O(n^3) algorithm and it is possible to do
this faster.
I am not sure, but maybe O(n^2 log(n)) is reachable, which is much faster.
It is easy to see that from the loops that the basic multiplications are
done n times for the same pairs of numbers.
In a similar way, the same additions are done multiple times.
That means that there is real potential to improve things.
For all small matrixes (let say 2 upto 4) one can write out all logic to
multiply them and then optimize the code for those
specific multiplications. This will give us fast multiplications for small
matrixes.
When having a larger matrix, is can be chopped into 4 smaller ones and these
pieces can then be multiplied and added together.
In total this can lead to a faster algorithm, although much more complex
then this multipliction.

Jeffrey R. Carter

unread,

Oct 23, 2006, 4:22:38 PM10/23/06

to

Robert A Duff wrote:
>
> In practise, you can count on the fact that all Ada compilers will
> choose row-major order, unless you ask for column-major via the above
> pragma, even though the RM does not say so.

Because of the implementation advice in ARM 3.6.2.: "An implementation
should normally represent multidimensional arrays in row-major order,

consistent with the notation used for multidimensional array aggregates

(see 4.3.3). However, if a pragma Convention(Fortran, ...) applies to a

multidimensional array type, then column-major order should be used

instead (see B.5, ``Interfacing with Fortran'')."

--
Jeff Carter
"Blessed are they who convert their neighbors'
oxen, for they shall inhibit their girth."
Monty Python's Life of Brian
83

Jeffrey R. Carter

unread,

Oct 23, 2006, 4:25:21 PM10/23/06

to

Wiljan Derks wrote:
>> Jeffrey R. Carter wrote:
>> I looked at the gcc FORTRAN matmul. Unless there some additional trickery
>> going on behind the scenes, it is not anything magical. It looks like a
>> matmul implemented in C with manual array subscripting logic (i.e. uses a
>> single dimensional array overlay)..

For the record, this was Jeffrey Creem, not me.

Jeffrey R. Carter

unread,

Oct 23, 2006, 4:29:09 PM10/23/06

to

Robert A Duff wrote:
>
> I certainly do not think you should start putting Convention(Fortran) on
> all your performance-critical arrays, just to make sure!

In the case of records, the representation is up to the compiler unless
there is a representation clause. I don't find it surprising that the
same is true of arrays. However, as with records, there should be some
way to specify the representation for arrays other than a pragma
Convention for another language.

Dr. Adrian Wrigley

unread,

Oct 24, 2006, 5:52:51 AM10/24/06

to

On Mon, 23 Oct 2006 21:52:59 +0200, Wiljan Derks wrote:

>> Jeffrey R. Carter wrote:
>> I looked at the gcc FORTRAN matmul. Unless there some additional trickery
>> going on behind the scenes, it is not anything magical. It looks like a
>> matmul implemented in C with manual array subscripting logic (i.e. uses a
>> single dimensional array overlay)..
> When going for a faster algorithm this might be usefull information.
> Notice that this matmul uses an O(n^3) algorithm and it is possible to do
> this faster.
> I am not sure, but maybe O(n^2 log(n)) is reachable, which is much faster.
> It is easy to see that from the loops that the basic multiplications are
> done n times for the same pairs of numbers.

I know you're quite correct that there is a faster algorithm for
large n. But looking at the loops, I don't see where the same
number pairs are multiplied at different points in the calculation.
Doesn't A(1,1)*B(1,1) only appear in C(1,1)? Where else?

> In a similar way, the same additions are done multiple times.
> That means that there is real potential to improve things.

Aren't they only the same additions if the same products are summed?

I may be being particularly stupid today, but if the redundancy
is obvious from the loops, can you help me see where! Thanks.
--
Adrian

Jeffrey Creem

unread,

Oct 24, 2006, 7:50:36 AM10/24/06

to

I have not looked at algorithms in this area in a few years but I do
recall that you need very very large N before any of them start being
worthwhile.

There are some speedups possibly that stay N**3 but optimize fetches and
stores for cache performance and do hand vectorization of portions of
the inner loop that can achieve speedup.

I would guess that even the library level matmul (which still has the
best overall time) could be made about 2x faster on a recent intel
platform with those changes.

Back to the original discussion, there are quite a few interesting
things going on.

1) I have confirmed what another poster wrote that 4.2.0 is
substantially slower than 4.0.3 on the original Ada example (thus, this
performance is a regression)
2) Small changes to the original code that still meet the original
structure and intent of the original code can move the run time up and
down by at least 50%
3) The two task based version of the code is running more than 2x faster
than the single task version on a 2 processor machine. Some of this is
from the two tasks but looking at the assembly, another portion of it is
related to #2 above in that the re-arrangement of the math allows the
compiler to get less brain dead.

If I get a chance tonight, I'll post some update results and analysis of
the generated code.

Jeffrey R. Carter

unread,

Oct 24, 2006, 12:24:25 PM10/24/06

to

Jeffrey Creem wrote:
>
> 2) Small changes to the original code that still meet the original
> structure and intent of the original code can move the run time up and
> down by at least 50%
> 3) The two task based version of the code is running more than 2x faster
> than the single task version on a 2 processor machine. Some of this is
> from the two tasks but looking at the assembly, another portion of it is
> related to #2 above in that the re-arrangement of the math allows the
> compiler to get less brain dead.

These seem quite odd to me. Perhaps whatever is causing this is also the
cause of the speedup I saw in the sequential case when the
multiplication is put in a procedure.

--
Jeff Carter
"You've got the brain of a four-year-old boy,
and I bet he was glad to get rid of it."
Horse Feathers
47

Wiljan Derks

unread,

Oct 24, 2006, 3:21:13 PM10/24/06

to

"Dr. Adrian Wrigley" <am...@linuxchip.demon.co.uk.uk.uk> wrote in message
news:pan.2006.10.24....@linuxchip.demon.co.uk.uk.uk...

> I know you're quite correct that there is a faster algorithm for
> large n. But looking at the loops, I don't see where the same
> number pairs are multiplied at different points in the calculation.
> Doesn't A(1,1)*B(1,1) only appear in C(1,1)? Where else?
>

> Aren't they only the same additions if the same products are summed?
>

Sorry, I was to fast writing this remark.
Since I was wrong, I checked my old school papers and found how to perfrom a
faster matrix multiplication.
It is called the Strassen algorithm.
For more details see: http://en.wikipedia.org/wiki/Strassen_algorithm
It still looks to me, there may be considerable savings since this algortihm
is O(n^2.8) instead of O(n^3).
Wikipedia also mention the drawback: numeric instability!
So the subtraction and addition tricks probably lead to less accurate
results in certain cases.

Jeffrey Creem

unread,

Oct 24, 2006, 11:50:58 PM10/24/06

to

Jeffrey R. Carter wrote:
> Jeffrey Creem wrote:
>
>>
>> 2) Small changes to the original code that still meet the original
>> structure and intent of the original code can move the run time up and
>> down by at least 50%
>> 3) The two task based version of the code is running more than 2x
>> faster than the single task version on a 2 processor machine. Some of
>> this is from the two tasks but looking at the assembly, another
>> portion of it is related to #2 above in that the re-arrangement of
>> the math allows the compiler to get less brain dead.
>
>
> These seem quite odd to me. Perhaps whatever is causing this is also the
> cause of the speedup I saw in the sequential case when the
> multiplication is put in a procedure.
>

Ok, we are all probably tired of taking about this topic but I posted
what I think will be the final write-up on this (until the bugzilla
issue for it is resolved).

http://gnuada.sourceforge.net/pmwiki.php/Main/Oct2006CLAGFORTRANComparison

And I agree that the > 2x speedup for the two task version is not real.
There are now other versions that are coded in a similar maner to the 2
task version but which are single threaded. The 2 task version runs
almost exactly 2x faster (sometimes slightly slower) on my dual
processor machine.

claude...@equipement.gouv.fr

unread,

Oct 25, 2006, 11:32:53 AM10/25/06

to

A way to have better performance is to optimize the use of the memory
cache.

An Ada writen matmul procedure to tend to the fortran intrinsic matmul
speed with -O2 and -gnatp

ada version with array size of 600 : time = 0.6722
fortran version with array size of 600 : time = 0.6094

package Array_Product is

type Real_Matrix is array(Integer range <>,Integer range <>) of
Float;
type Matrix_Access is access Real_Matrix;

procedure Matmul (A, B : in Real_Matrix; C : out Real_Matrix);

end Array_Product;

package body Array_Product is

procedure Matmul (A, B : in Real_Matrix; C : out Real_Matrix)
is
BL : array (1 .. B'Length (1) * B'Length (2)) of Float;
D : Natural := 0;
Sum : Float := 0.0;
begin
D := 0;
-- transposition gave the best speedup cache + only one indice
Transpose_B_in_BL :
for J in B'Range (1) loop
for R in B'Range (2) loop
BL( D + R) := B(R,J);
end loop;
D := D + B'Length (2);
end loop Transpose_B_in_BL;

for I in A'Range (1) loop
declare
-- give the cache best chance with A second speedup
Ai : array (A'range (2)) of Float;
begin
for R in Ai'Range loop
Ai (R) := A (I, R);
end loop;
D := 0;
for J in A'range(2) loop
Sum := 0.0;
for R in A'Range (2) loop
Sum := Sum + Ai(R)*BL(D + R);
end loop;
D := D + A'Length (2);
C(I,J) := Sum;
end loop;
end;
end loop;
end Matmul;

end Array_Product;

Jeffrey Creem a écrit :