why is Fortran A*B (for matrices) seem to be slow?

Nasser M. Abbasi

unread,

Dec 12, 2010, 3:50:40 AM12/12/10

to

Platform:
mingw32, on windows 7 with gfortran 4.6 for fortran

I am multiplying 2 matrices, A with B. For testing of speed,
I am finding the CPU time for Fortran slow compared with
Mathematica or Matlab. 10 times slower.

There is good chance I am not doing this right.

I use MATMUL and I also use dgemm(), which I am not too
familiar with, I copied the call from a post I saw, but
verified it is correct by comparing to document of lapack.

I run a loop, multiplying A with B 100 times. Modify A
slightly each time. The size of the matrices is 1000 by 1000.

Here is the result from Fortran (gfortran 4.6) compiled with -O3

---------------------------------------
$ gfortran -O3 t2.f90 /usr/lib/liblapack.a /usr/lib/libblas.a
$ ./a.exe

Time, MATMUL: 103.10101
dgemm: 116.79800
---------------------------------------

Here is the result from Mathemtica (the fastest) version 8.0

8.174

Here is the result from Matlab version 2010a, almost the
same as Mathematica

Elapsed time is 8.378947 seconds.
--------------------------------------

Below is the code I used for all the above, and the compile
command I used for Fortran.

Does any one see a reason why I am getting such slow numbers
for Fortran? I am no expert in Fortran (nor in the others for
that matter, I am student learning little bit of Fortran, so
I might not be doing something correct w.r.t Fortran).

------- FORTRAN t2.f90 ------------------
program t2
implicit none

REAL time_begin, time_end
integer, parameter :: n=1000;
REAL(8) :: a(n,n), b(n,n), c(n,n)
integer, parameter :: m = 100
integer :: i

call RANDOM_NUMBER(a)
call RANDOM_NUMBER(b)

call cpu_time(time_begin)

do i = 1,m
a(1,1) = a(1,1) + 0.1
c = MATMUL(a,b)
enddo

call cpu_time(time_end)

PRINT *, 'Time, MATMUL: ',time_end-time_begin

call cpu_time(time_begin)
do i = 1,m
a(1,1) = a(1,1) + 0.1
call dgemm('n','n',n, n, n, dble(1.0), a, n, b, n, dble(0.0), c, n)
enddo

call cpu_time(time_end)
PRINT *, 'dgemm: ',time_end-time_begin

end program
-------------------------------------------------
$ gfortran -O3 t2.f90 /usr/lib/liblapack.a /usr/lib/libblas.a

Matlab:
----------- matlab script -------------
n=1000;
m=100;
A = rand(n);
B = rand(n);
C = zeros(n);
tic
for i = 1:m
A(1,1) = A(1,1)+0.1;
C = A * B;
end
toc
-----------------------------

Mathematica

------ script -------------

n=1000;
a=RandomReal[{0,1},{n,n}];
b=RandomReal[{0,1},{n,n}];
m=100;

Timing[
For[i=0,i<m,i++,
{
a[[1,1]] = a[[1,1]] + .1;
c = a.b;
}
]
]

-------------------------

--Nasser

robin

unread,

Dec 12, 2010, 4:49:09 AM12/12/10

to

"Nasser M. Abbasi" <n...@12000.org> wrote in message news:ie22de$7ns$1...@speranza.aioe.org...

Did you compare results? Did you try it with n = 3, m = 1 ?

Colin Watters

unread,

Dec 12, 2010, 5:03:17 AM12/12/10

to

"Nasser M. Abbasi" <n...@12000.org> wrote in message
news:ie22de$7ns$1...@speranza.aioe.org...
>

Depending how good the optimizer is in Matlab, it might be optimizing the
loop out of existence. Most optimizing Fortran compilers will do this as
well given half a chance.

In the Matlab script you show, you do NOTHING with the result of the matrix
multiply, C. Hence the optimizer can choose not to store it, and logically
therefore, not to perform the operation that calculates the result it has
decided not to store. Same then goes for adding 0.1 to A(1,1). So the loop
is now empty: so it too gets optimized out of existence.

You would be left with setting A, B and C to the results of the functions
RAND and ZEROS. Probably, Matlab can't optimize these away, so the result
is, you are timing how long these take.

Caveat: I know nothing of how Matlab works or how good its optimization is,
so the above is nothing more than educated guesswork. But I have seen
similar results from FORTRAN (it was pre-90) benchmarks.

The Fortran loop cannot be "optimized" in this way since the optimizer
doesn't know what happens inside the MATMUL function, nor the DGEMM
subroutine. However, if you replaced these with a series of convoluted
mathematical statements and intrinsic function references, but forgot to do
anything with the answer, the same behaviour will likley occur.

--
Qolin

Email: my qname at domain dot com
Domain: qomputing

Nasser M. Abbasi

unread,

Dec 12, 2010, 5:12:16 AM12/12/10

to

On 12/12/2010 1:49 AM, robin wrote:

>
> Did you compare results? Did you try it with n = 3, m = 1 ?
>

No, I did not try n=3, m=1, I am not interested in small toy problems.

I wanted to see the performance on something little more
real, so I choose n=1000 size matrix, and m=100 iterations.

--Nasser

Nasser M. Abbasi

unread,

Dec 12, 2010, 5:21:43 AM12/12/10

to

On 12/12/2010 2:03 AM, Colin Watters wrote:

>
> Depending how good the optimizer is in Matlab, it might be optimizing the
> loop out of existence. Most optimizing Fortran compilers will do this as
> well given half a chance.
>

Ok, So added some logic in the loop, to hopefully prevent any funny
optimization of loop out of existance to occure. I did not want to
add a logic that will distrube the point of this exersise, which is
to test A*B performance. Will this below now meet your requirment?
This check should prevent the loop from being optimized away I would
think:

--------------------------------------------

n=1000;
m=100;
A = rand(n);
B = rand(n);
C = zeros(n);
tic

for i = 1:m
A(1,1) = A(1,1)+0.1;
C = A * B;

if C(3,3) == C(5,5)
fprintf('***********************');
end

end
toc
Elapsed time is 6.945720 seconds.

------------------------------------------

Stil more than 10 times faster. If you still think the loop is optimized
away, I'll be happy to add more code to prevent this, if you suggest
something else.

> In the Matlab script you show, you do NOTHING with the result of the matrix
> multiply, C. Hence the optimizer can choose not to store it, and logically
> therefore, not to perform the operation that calculates the result it has
> decided not to store. Same then goes for adding 0.1 to A(1,1). So the loop
> is now empty: so it too gets optimized out of existence.
>
> You would be left with setting A, B and C to the results of the functions
> RAND and ZEROS. Probably, Matlab can't optimize these away, so the result
> is, you are timing how long these take.
>
> Caveat: I know nothing of how Matlab works or how good its optimization is,
> so the above is nothing more than educated guesswork. But I have seen
> similar results from FORTRAN (it was pre-90) benchmarks.
>
> The Fortran loop cannot be "optimized" in this way since the optimizer
> doesn't know what happens inside the MATMUL function, nor the DGEMM
> subroutine. However, if you replaced these with a series of convoluted
> mathematical statements and intrinsic function references, but forgot to do
> anything with the answer, the same behaviour will likley occur.
>

--Nasser

Aris

unread,

Dec 12, 2010, 7:43:21 AM12/12/10

to

"Nasser M. Abbasi" <n...@12000.org> wrote:
> Platform:
> mingw32, on windows 7 with gfortran 4.6 for fortran
>
> I am multiplying 2 matrices, A with B. For testing of speed,
> I am finding the CPU time for Fortran slow compared with
> Mathematica or Matlab. 10 times slower.

Your Fortran example multiplies double precision matrices. In the Matlab
and the Mathematica examples I cannot tell.

Nasser M. Abbasi

unread,

Dec 12, 2010, 8:01:37 AM12/12/10

to

Matlab uses double by default.

--Nasser

nm...@cam.ac.uk

unread,

Dec 12, 2010, 7:23:51 AM12/12/10

to

In article <ie2h3u$9t2$1...@speranza.aioe.org>,

Sigh. This is almostly certainly the difference between blocked
and unblocked methods. You can implement either in Fortran, and
LAPACK has options to configure/select that. On a modern cached
machine, blocked methods are almost invariably faster.

I am a bit surprised that the version of DGEMM you found was slow,
but there are certainly fast ones around.

Regards,
Nick Maclaren.

Ian Bush

unread,

Dec 12, 2010, 8:29:23 AM12/12/10

to

Are you sure your mathematica and matlab scripts are doing what you think
they are doing? By my estimation from what you have said above you are claiming
that on single core they are getting around 24 GFlops. I don't know of a current
x86 core which can manage that ... The Fortran numbers look a lot more sensible.

A quick check on the results might be a good idea,

Ian

James Van Buskirk

unread,

Dec 12, 2010, 10:16:02 AM12/12/10

to

"Ian Bush" <please@dont_email_me.at.all> wrote in message
news:ie2inm$g8o$1...@news.eternal-september.org...

> Are you sure your mathematica and matlab scripts are doing what you think
> they are doing? By my estimation from what you have said above you are
> claiming
> that on single core they are getting around 24 GFlops. I don't know of a
> current
> x86 core which can manage that ... The Fortran numbers look a lot more
> sensible.

A 3 GHz Core 2 Duo could manage 24 GFlops if it used both cores given
perfect code. On Windows, you could right-click the tas bar, bring up
Task Manager and look at the Performance tab to see if the machine
were using all cores as it did the calculation.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

Nasser M. Abbasi

unread,

Dec 12, 2010, 10:22:01 AM12/12/10

to

On Dec 12, 5:29 am, Ian Bush <please@dont_email_me.at.all> wrote:

On 12/12/2010 5:29 AM, Ian Bush wrote:

>
> Are you sure your mathematica and matlab scripts are doing what you think
> they are doing? By my estimation from what you have said above you are claiming
> that on single core they are getting around 24 GFlops. I don't know of a current
> x86 core which can manage that ... The Fortran numbers look a lot more sensible.
>
> A quick check on the results might be a good idea,
>
> Ian
>

OK, I redid this again. Now I print the euclidean norm of the
resulting matrix of A*B, i.e. the C matrix, and it is indeed is
changing, so the multiplication does occur.

In addition, I make a new A,B random matrices inside the loop.

So as not to affect the time used by A*B, I tally the time only
for A*B operation and not for any of the other operations.

The result is the same. And this proofs the multiplication does
occur. I read somewhere, that Matlab and Mathematica ship with
their own optimized lapack and blas libraries. So may be
it is just that the lapack and blas libraries that come with gfortran
or gcc on my system are not build optimized? Just a thought.

------------------------------
EDU>> clear all;

n = 1000;
m = 100;
C = zeros(n);
total_cpu = 0;

A = rand(n);
B = rand(n);

total_elapsed_time=0;

for i = 1:m

A = rand(n);
B = rand(n);

tstart = tic;

C = A * B;

total_elapsed_time = total_elapsed_time + toc(tstart);
fprintf('Iteration [%d],Elspased time=%f, norm(C) =%f\n',...
i, total_elapsed_time,norm(C,2));

end
fprintf(' total_elapsed_time =%f\n', total_elapsed_time);

Iteration [1],Elspased time=0.056701, norm(C) =249789.647410
Iteration [2],Elspased time=0.109155, norm(C) =250211.844632
Iteration [3],Elspased time=0.170553, norm(C) =250153.270266
Iteration [4],Elspased time=0.252508, norm(C) =249756.111917
Iteration [5],Elspased time=0.336906, norm(C) =250181.077032
....
Iteration [96],Elspased time=7.281783, norm(C) =250026.646782
Iteration [97],Elspased time=7.361807, norm(C) =249820.717891
Iteration [98],Elspased time=7.430531, norm(C) =250126.197533
Iteration [99],Elspased time=7.512295, norm(C) =249934.773539
Iteration [100],Elspased time=7.581717, norm(C) =250113.924458
Elapsed time is 646.543635 seconds.
total_elapsed_time =7.581717
----------------------------------

Note, from help

" TOC Read the stopwatch timer.
TIC and TOC functions work together to measure elapsed time."

This is elapsed time, not CPUTIME. But I am not running anything
else on my PC. CPUTIME gives, sometimes different results, depending
on the hardware:

http://www.mathworks.com/help/techdoc/matlab_prog/f8-790895.html

"It has been the general rule for CPU-intensive calculations run on
Microsoft Windows machines that the elapsed time using cputime and the
elapsed time using tic and toc are close in value, ignoring any first
time costs. There are cases however that show a significant difference
between these two methods. For example, in the case of a Pentium 4
with hyperthreading running Windows, there can be a significant
difference between the values returned by cputime versus tic and toc."

However, I thought to be fair to Fortran, I should do the same thing
above, but using CPUTIME, and here is the result: Matlab is still
faster, but not by only 3 times as much. I do not know exactly why
CPUTIME is not the close to elapsed time, as I am not running anything
else on the PC while I do this, but from reading the above page, one
can see there are cases where this might be. So here is the new
result:

--------------
clear all;

n = 1000;
m = 100;
C = zeros(n);
total_cpu = 0;

A = rand(n);
B = rand(n);

total_CPU_time=0;

for i = 1:m

A = rand(n);
B = rand(n);

tstart = cputime;

C = A * B;

total_CPU_time = total_CPU_time + (cputime-tstart);
fprintf('Iteration [%d],total_CPU_time=%f, norm(C) =%f\n',...
i, total_CPU_time,norm(C,2));

end

fprintf(' total_CPU_time =%f\n', total_CPU_time);

fprintf(' total_CPU_time =%f\n', total_CPU_time);
Iteration [1],total_CPU_time=0.187201, norm(C) =250057.322400
Iteration [2],total_CPU_time=0.436803, norm(C) =250639.271496
Iteration [3],total_CPU_time=0.686404, norm(C) =249457.742619
Iteration [4],total_CPU_time=0.998406, norm(C) =249923.147426
Iteration [5],total_CPU_time=1.372809, norm(C) =250380.116405
Iteration [6],total_CPU_time=1.684811, norm(C) =250342.472599
Iteration [7],total_CPU_time=2.059213, norm(C) =249704.501388
Iteration [8],total_CPU_time=2.371215, norm(C) =250596.515308
Iteration [9],total_CPU_time=2.683217, norm(C) =249867.342513
Iteration [10],total_CPU_time=2.995219, norm(C) =250331.227957
Iteration [11],total_CPU_time=3.307221, norm(C) =250174.321379
...
Iteration [99],total_CPU_time=29.718190, norm(C) =249981.112816
Iteration [100],total_CPU_time=30.092593, norm(C) =250159.666322
Elapsed time is 1256.348858 seconds.
total_CPU_time =30.092593
--------------------------------------

Compare to Fortran:

$ gfortran -O3 t2.f90 /usr/lib/liblapack.a /usr/lib/libblas.a
$ ./a.exe

Time, MATMUL: 103.10101
dgemm: 116.79800

--Nasser

Nasser M. Abbasi

unread,

Dec 12, 2010, 10:25:50 AM12/12/10

to

On Dec 12, 7:16 am, "James Van Buskirk" <not_va...@comcast.net> wrote:
> "Ian Bush" <please@dont_email_me.at.all> wrote in message
>
> news:ie2inm$g8o$1...@news.eternal-september.org...
>
> > Are you sure your mathematica and matlab scripts are doing what you think
> > they are doing? By my estimation from what you have said above you are
> > claiming
> > that on single core they are getting around 24 GFlops. I don't know of a
> > current
> > x86 core which can manage that ... The Fortran numbers look a lot more
> > sensible.
>

> A 3 GHz Core 2 Duo could manage 24 GFlops if it used both cores given
> perfect code. On Windows, you could right-click the tas bar, bring up
> Task Manager and look at the Performance tab to see if the machine
> were using all cores as it did the calculation.
>

Yes, my Machine is 4 core, intel I7 CPU 930 @ 2.8 GHZ, HPE model 270f,
with 8 GB RAM,
running windows 7, 64 bit

http://ark.intel.com/Product.aspx?id=41447

--Nasser

Ian Bush

unread,

Dec 12, 2010, 10:33:29 AM12/12/10

to

James Van Buskirk wrote:
> "Ian Bush" <please@dont_email_me.at.all> wrote in message
> news:ie2inm$g8o$1...@news.eternal-september.org...
>
>> Are you sure your mathematica and matlab scripts are doing what you think
>> they are doing? By my estimation from what you have said above you are
>> claiming
>> that on single core they are getting around 24 GFlops. I don't know of a
>> current
>> x86 core which can manage that ... The Fortran numbers look a lot more
>> sensible.
>
> A 3 GHz Core 2 Duo could manage 24 GFlops if it used both cores given
> perfect code.

I would put an "in theory" somewhere in the above sentence.

> On Windows, you could right-click the tas bar, bring up
> Task Manager and look at the Performance tab to see if the machine
> were using all cores as it did the calculation.
>

Yes, I'm sure that at least some of the discrepancy is due to Mathematica and
MATLAB using a multithreaded BLAS, the OP just can't get the perofrmance he
is seeing otherwise. And I suspect the rest is due to the commercial
packages using a better optimised BLAS (MKL maybe)

To the OP - where did you get your BLAS from?

Ian

Ian Bush

unread,

Dec 12, 2010, 10:57:01 AM12/12/10

to

(Talking to myself as usual ...)

D'oh, and of course the OP was using CPU_TIME to measure the time in Fortran, but
(as I understand it) wall time for the other two, a mistake I'm always making ...
So even if he is using a multithreaded BLAS from Fortran he's not comparing like
with like.

To the OP - if you measure the wall time for Fortran what do you get?

Ian

Nasser M. Abbasi

unread,

Dec 12, 2010, 11:00:01 AM12/12/10

to

On Dec 12, 7:33 am, Ian Bush <please@dont_email_me.at.all> wrote:

>
> To the OP - where did you get your BLAS from?
>
> Ian

I am the OP replying.

They came with MINGW32, part of the whole installation.

http://www.mingw.org/

-- the OP

Nasser M. Abbasi

unread,

Dec 12, 2010, 11:17:21 AM12/12/10

to

On Dec 12, 7:57 am, Ian Bush <please@dont_email_me.at.all> wrote:

>
> (Talking to myself as usual ...)
>
> D'oh, and of course the OP was using CPU_TIME to measure the time in Fortran, but
> (as I understand it) wall time for the other two, a mistake I'm always making ...

It is not a mistake. If you just read what I wrote.

I knew the difference. I showed the result for both wall time and
CPUTIME on my second post.

In theory, these should be the same on single user PC with nothing
else running. But these
days, with multithreading and multicores all over the place, CPUTIME
is not as clear in meaning as it used to be.

> So even if he is using a multithreaded BLAS from Fortran he's not comparing like
> with like.
>
> To the OP - if you measure the wall time for Fortran what do you get?
>
> Ian

I am the OP again. Here is Fortran code with elapsed time. I found
couple of lines of code on the net that does elapsed time on fortran
and used it, here is the result:

--------------------------

program t2
implicit none
REAL time_begin, time_end
integer, parameter :: n=1000;
REAL(8) :: a(n,n), b(n,n), c(n,n)
integer, parameter :: m = 100
integer :: i

real etime ! Declare the type of etime()
real elapsed(2) ! For receiving user and system time
real total ! For receiving total time

call RANDOM_NUMBER(a)
call RANDOM_NUMBER(b)
call cpu_time(time_begin)
do i = 1,m
a(1,1) = a(1,1) + 0.1
c = MATMUL(a,b)
enddo

call cpu_time(time_end)
total = etime(elapsed)

PRINT *, 'Time, MATMUL: ',time_end-time_begin

print *, 'End: total=', total, ' user=', elapsed(1),' system=',
elapsed(2)

$ gfortran -O3 -ffast-math -funroll-loops t2.f90 /usr/lib/
liblapack.a /usr/lib/libblas.a
$ ./a.exe
Time, MATMUL: 93.974998
End: total= 94.053001 user= 94.037399 system= 1.55999996E-02

------------------------------------------------------------

-- the OP

Ian Bush

unread,

Dec 12, 2010, 12:07:55 PM12/12/10

to

Nasser M. Abbasi wrote:
> On Dec 12, 7:57 am, Ian Bush <please@dont_email_me.at.all> wrote:
>
>> (Talking to myself as usual ...)
>>
>> D'oh, and of course the OP was using CPU_TIME to measure the time in Fortran, but
>> (as I understand it) wall time for the other two, a mistake I'm always making ...
>
> It is not a mistake. If you just read what I wrote.
>
> I knew the difference. I showed the result for both wall time and
> CPUTIME on my second post.

No - I was referring to my mistake - what I always forget is that typically CPU_TIME adds
up time across the threads. Which is a great way of showing no parallel speed up, and
that's the mistake I always forget ...

>
> In theory, these should be the same on single user PC with nothing
> else running. But these
> days, with multithreading and multicores all over the place, CPUTIME
> is not as clear in meaning as it used to be.

Agreed.

>
>> So even if he is using a multithreaded BLAS from Fortran he's not comparing like
>> with like.
>>
>> To the OP - if you measure the wall time for Fortran what do you get?
>>
>> Ian
>
> I am the OP again. Here is Fortran code with elapsed time. I found
> couple of lines of code on the net that does elapsed time on fortran
> and used it,

Look up system_clock or date_and_time, those are the standard fortran ways.

OK - so my theory is

1) The Fortran code is not multi-threaded, while the others are
2) The libraries used by Mathematica and MATLAB are much more optimised
than the one you are linking to your Fortran program.

If you really want to look into this you could always try ATLAS, see

http://math-atlas.sourceforge.net/

With gfortran -fexternal-blas could be useful as well,

Ian

steve

unread,

Dec 12, 2010, 12:46:15 PM12/12/10

to

On Dec 12, 12:50 am, "Nasser M. Abbasi" <n...@12000.org> wrote:
>
> Does any one see a reason why I am getting such slow numbers
> for Fortran?

gfortran uses a straight forward linear algebra definition
of matrix multiplication. It does not use a blocked scheme
nor does it take advantage of multiple cores. Why? Because
gfortran runs on i386, x86_64, ppc, mips, sparc, sparc64,
arm, alpha, i64, hppa, etc. A scheme that works well on
x86_64 may not work so well on sparc.

Matlab and Mathematica work on 2 platforms: i386 and x86_64.
The army of paid programmers for these commercial products
can spend paid time on optimizing their matrix multiplication
schemes.

The unpaid, 5 or so, gfortran volunteers wrote a matmul
that works. Those same unpaid gfortran volunteers provided
an option that would allow a user to use any optimized BLAS
implementation of his choice. See the documentation.

As an exercise on how difficult it is to find an optimize
blas, download ATLAS from netlib and watch it tune the
library. It will take the better part of day on your
hardware.

--
steve

Ron Shepard

unread,

Dec 12, 2010, 3:08:26 PM12/12/10

to

In article <ie22de$7ns$1...@speranza.aioe.org>,

"Nasser M. Abbasi" <n...@12000.org> wrote:

> Platform:
> mingw32, on windows 7 with gfortran 4.6 for fortran

What is your hardware?

> I am multiplying 2 matrices, A with B. For testing of speed,
> I am finding the CPU time for Fortran slow compared with
> Mathematica or Matlab. 10 times slower.

There are several odd looking things with your timings.

[...]

> I use MATMUL and I also use dgemm(), which I am not too
> familiar with, I copied the call from a post I saw, but
> verified it is correct by comparing to document of lapack.

Yes, I think your dgemm() call is right, although it is written in
sort of an odd way.

> Time, MATMUL: 103.10101
> dgemm: 116.79800

This looks odd because dgemm() is almost always faster than the
fortran MATMUL routine. An exception is if you use compiler options
that allow MATMUL to use dgemm(), in which case the two times should
be about the same. In gfortran, I think the option is
-fexternal-blas, as noted by another poster.

So the first question is, where are you getting your blas library?
Is it using multiple cores (if you hardware has multiple cores)? Is
it using SSE instructions (if your hardware has SSE)? Does it
optimize strides, subblocking, and cache usage (e.g. like the MKL
and ATLAS blas libraries do)?

> $ gfortran -O3 t2.f90 /usr/lib/liblapack.a /usr/lib/libblas.a

Why are you using liblapack.a? It does not look like this should be
required. And as above, where did the libblas.a come from? Is it
compiled from the reference fortran source with no optimization? If
so, that might explain the timing problems.

>
> Matlab:
> ----------- matlab script -------------
> n=1000;
> m=100;
> A = rand(n);
> B = rand(n);
> C = zeros(n);

I don't know matlab syntax. Are these vectors or matrices?

As far as matlab timings in general, matlab uses the blas for the
low-level operations, so the matlab times should be the same as blas
times. The fact that they aren't suggests that you are using some
kind of unoptimized or otherwise hamstrung blas library, while
matlab is using an optimized library. In addition to general
optimization such as memory strides and cache usage, there are also
the issues related to using SSE instructions and multiple cores.

>
> Mathematica
>
> ------ script -------------
>
> n=1000;
> a=RandomReal[{0,1},{n,n}];
> b=RandomReal[{0,1},{n,n}];
> m=100;
>
> Timing[
> For[i=0,i<m,i++,
> {
> a[[1,1]] = a[[1,1]] + .1;
> c = a.b;
> }
> ]
> ]

It is very difficult to keep rows and columns straight when using
mathematica, even for experienced mathematica programmers. As far
as matrix element storage is concerned, I think the above is
equivalent to something like

call dgemm('t', 't', n,n,n, 1.0_wp, a, n, b, n, 0.0_wp, c, n)

Or maybe the arguments "a" and "b" need to be switched (i.e. the
identity C^T=(A.B)^T=B^T.A^T), I'd have to think about it some more.
Since you are just multiplying random elements and the result does
not matter, this does not account for a factor of 10 difference in
times, it is just something to keep in mind when comparing results
and times using the different languages.

One other comment is that the fortran expression A*B in the thread
title is not matrix multiplication in fortran. It is an
element-by-element multiplication of the two arrays. You realize
this in the fortran code and you correctly use matmul and dgemm as
appropriate. I don't know about matlab, but the times seem to be
consistent with mathematica, so it's probably alright there too.
But since you posted this thread to a fortran newsgroup, it seems
worth noting.

$.02 -Ron Shepard

dpb

unread,

Dec 12, 2010, 3:26:54 PM12/12/10

to

Ron Shepard wrote:
> In article <ie22de$7ns$1...@speranza.aioe.org>,
> "Nasser M. Abbasi" <n...@12000.org> wrote:

...

>> Matlab:
>> ----------- matlab script -------------
>> n=1000;
>> m=100;
>> A = rand(n);
>> B = rand(n);
>> C = zeros(n);
>
> I don't know matlab syntax. Are these vectors or matrices?

In Matlab, the above each generate a square array nxn of DOUBLEs

> As far as matlab timings in general, matlab uses the blas for the
> low-level operations, so the matlab times should be the same as blas
> times. The fact that they aren't suggests that you are using some
> kind of unoptimized or otherwise hamstrung blas library, while
> matlab is using an optimized library. In addition to general
> optimization such as memory strides and cache usage, there are also
> the issues related to using SSE instructions and multiple cores.

Indeed, Matlab uses all of the above internally and transparently to the
user as well as possibly using the mutliple cores depending on which
version of ML OP has (I'm not sure with which release TMW began using
threaded versions; I'm hamstrung w/ an early release).

...

I don't know nuttin' about Mathematica but think as others noted that
there's no question OP is comparing apples to oranges as far as Matlab
vs what he's written in Fortran w/ the particular version of libraries
and/or gfortran intrinsic...

--

Wolfgang Kilian

unread,

Dec 12, 2010, 4:35:42 PM12/12/10

to

On 12/12/2010 09:50 AM, Nasser M. Abbasi wrote:
>
> Platform:
> mingw32, on windows 7 with gfortran 4.6 for fortran
>
> I am multiplying 2 matrices, A with B. For testing of speed,
> I am finding the CPU time for Fortran slow compared with
> Mathematica or Matlab. 10 times slower.
>
> There is good chance I am not doing this right.

> [...]
> --Nasser

You caught gfortran on a well-known weak spot. It might be useful to
add another data point.

Repeating the test with Mathematica 7.0 and gfortran 4.5.0 on my linux
laptop (Core 2 Duo 2.8GHz, 4GB), I get:
Mathematica: 21s
gfortran -O3: 158s (MATMUL)

Now repeat this with a different Fortran compiler:
nagfor5.2 -O3: 30s

This is actually faster than Mathematica: while the Fortran executable
used 100% CPU, Mathematica used 200% (i.e., both cores). Maybe the
(threading) overhead is significant, or NAG has a better MATMUL than
Mathematica ...

The message is: while the gfortran team did a really great job in
implementing a full-fledged Fortran compiler in their spare time, there
are still particular use cases where commercial products perform better.
(See Steve's post and the documentation for -fexternal-blas.)

Note that there are many benchmarks where gfortran actually outperforms
commercial products. Just not this one.

-- Wolfgang

--
E-mail: firstnameini...@domain.de
Domain: yahoo

Ian Harvey

unread,

Dec 12, 2010, 5:25:09 PM12/12/10

to

On 12/12/10 19:50, Nasser M. Abbasi wrote:
>
> Platform:
> mingw32, on windows 7 with gfortran 4.6 for fortran

On an Intel Core2 Duo T9400 cpu (a dual-core laptop, burdened with the
usual bevy of bloated business software...), running 32 bit Windows XP,
using Intel Visual Fortran Compiler XE for IA-32 version 12.1.127 with
command line options "/warn:all /check:none /O3 /Qxhost /Qmkl:parallel",
I compiled a slight modification of the source in your original post
(removed mixed precision arithmetic and added system_clock timings) and
get numbers like:

cpu_time for matmul: 43.4
system_clock for matmul: 43.5
cpu_time for dgemm: 22.7
system_clock for dgemm: 11.6

On the same platform with Matlab R2009B and your original script I get
an typical second or subsequent run tic-toc time of 11.9 seconds (the
first run after an edit was always about 0.5 seconds slower - which I
take to be the time for the p-code generation or whatever they do these
days).

For kicks I switched the Fortran to the F95 interface for the GEMM call
and observed no difference in timings. The CALL statement involved
quite a bit less typing though...

As a user of the Intel compiler, I see some very familiar DLL's in my
Matlab's "bin" directory. As a result, bar interpreter overhead or
differences in higher level optimiser cleverness(*), I'd wouldn't be
expecting significant differences.

Later I looked at the relative memory usages of the two approaches. A
typical working set requirement for the fortran program was 25 MB in the
MATMUL bit, jumping to 30 MB in GEMM, while Matlab was approximately 60
MB (using a freshly started Matlab environment and emptying the working
set for the Matlab process before each run of the script).

IanH

(*) I was half expecting the Intel compiler to pick that the variable
"c" wasn't being used, and so remove the MATMULT and GEMM calls. But
obviously it didn't.

robin

unread,

Dec 12, 2010, 8:14:52 PM12/12/10

to

"Nasser M. Abbasi" <n...@12000.org> wrote in message news:ie276d$ij4$1...@speranza.aioe.org...

| On 12/12/2010 1:49 AM, robin wrote:
|
| > Did you compare results? Did you try it with n = 3, m = 1 ?
| >
|
| No, I did not try n=3, m=1, I am not interested in small toy problems.

Without doing that, you have no idea whether your codes
are all solving the same problem.

| I wanted to see the performance on something little more
| real, so I choose n=1000 size matrix, and m=100 iterations.

As you haven't verified whether the codes are solving
the same problem, doing 1000 x 100 is meaningless.

Message has been deleted

gmail-unlp

unread,

Dec 14, 2010, 5:24:32 PM12/14/10

to

Hi, some combinations/experiments with your code, maybe related only
to

> might not be doing something correct w.r.t Fortran

A) Linking with blas library "included" in/by the gfortran compiler/
linker

$ gfortran -O3 t2.f90 -lblas
$ time a.out
Time, MATMUL: 218.35580
dgemm: 226.20560

real 7m24.694s
user 7m24.597s
sys 0m0.026s

This "means":
A.1) Most likely, MATMUL uses the same algorithm as dgemm
A.2) The process a.out is using only one core (8 available), since
"real" is about the same as "user"

B) Linking with ATLAS without threading
$ gfortran -O3 t2.f90 [ATLAS-libdir]/libf77blas.a [ATLAS-libdir]/
libatlas.a
$ time a.out
Time, MATMUL: 217.12399
dgemm: 29.367538

real 4m6.595s
user 4m6.524s
sys 0m0.032s

This "means":
B.1) MATMUL and "gfortran blas" are using a performance-poor algorithm
(btw: I would not enhance gfortran algorithms, there are specific
libraries such as ATLAS which already do so).
B.2) The process a.out is using one core (8 available), since "real"
is about the same as "user"
B.3) The reduction in runtime due to using the ATLAS version of dgemm
is about
226.20560s - 29.367538s = 196.838062s
B.4) The speedup due to using the ATLAS version of dgemm instead of
the one in "gfortran blas" is about
226.2056 / 29.367538 which is about 7.7

C) Linking with ATLAS with threading
$ gfortran -O3 t2.f90 -pthread [ATLAS-libdir]/libptf77blas.a [ATLAS-
libdir]/libatlas.a
$ time a.out
Time, MATMUL: 215.96117
dgemm: 36.310486

real 3m43.482s
user 4m4.286s
sys 0m8.050s

This "means":
C.1) The process a.out is using more than one core (8 available),
since "real" is less than "user"
C.2) dgemm() is using more than one core, since everything else uses
just one.
C.3) The reduction in runtime due to using the threaded ATLAS version
(instead of the non-threaded one) is about
4m6.595s - 3m43.482s = 23.108s
i.e. dgemm() is using "just"
29.367538s - 23.108s = 6.259538s
which is not 1/8 of the sequential time, by the way...
C.4) cpu_time() adds up the cpu time used in every core, that's why
dgemm()
time is reported greater than in the previous runs but the "real
elapsed time" is less than "real elapsed time" in the previous runs.
C.5) Overhead for processing in threads in dgemm() is about
36.310486s - 29.367538s = 6.942948s

I don't know about the other tools/libraries/programs but I think that
the Fortran runtime can be reduced a lot by using the appropriate
libraries.

Hope this helps,

Fernando.

JB

unread,

Dec 15, 2010, 9:16:28 AM12/15/10

to

On 2010-12-14, gmail-unlp <ftin...@gmail.com> wrote:
> A) Linking with blas library "included" in/by the gfortran compiler/
> linker

For the record, there is no BLAS library included with the official
gfortran compiler releases. There might of course exist 3rd party
distributions that do include gfortran, a BLAS library and whatever
else.

As mentioned previously in this thread, there is the -fexternal-blas
option that makes gfortran, where possible, use a user-provided BLAS
library instead of the builtin matrix multiplication procedure for
MATMUL.

--
JB

gmail-unlp

unread,

Dec 15, 2010, 10:31:00 AM12/15/10

to

On Dec 15, 11:16 am, JB <f...@bar.invalid> wrote:

Yes, my mistake, thank for making it clear... again...

Fernando.