How well to real Fortran compilers (e.g. Intel's) automatically parallelize
such code?
--
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com/products/?u
The LAPACK libraries make use of lower level routines from the BLAS
libraries to perform more basic linear algebra operations (such as
matrix-matrix multiplication.) In practice, nearly all of the time
spent within a LAPACK subourtine call is actually spent in lower level
BLAS subroutines. There are many threaded implementations of the BLAS
that can be used with a single threaded LAPACK to greatly speed up
your code.
In particular, look at the ATLAS implementation of the BLAS- it's open
source software and runs on a wide variety of systems.
There are also a number of commercial packages that include threaded
BLAS and LAPACK libraries. Intel's product is the "Math Kernel
Library (MKL)", while AMD's is called "ACML" These libraries also
contain other functions- for example the MKL includes fast fourier
transform routines.
If your code spends most of its time in the LAPACK/BLAS routines, and
if it is spending most of its time on the level 3 operations (O(n^3)
operations such as matrix factorizations), then you'll typically see a
very good speedup with the two to four processing cores common on
today's desktop machines.
However, if your code spends a lot of time in level 1 and level 2 BLAS
operations, then you probably will be disappointed with the speedup.
One reason for the poor performance is that these machines have
relatively low memory bandwidth compared to their CPU speeds. In
level 3 operations you can overcome this problem by bringing data into
local cache memory and reusing it several times before flushing it out
of the cache. For level 1 and level 2 BLAS operations there's no such
advantage. As a result, level 1 and level 2 BLAS operations tend to
be limited by the available memory bandwidth.
Note that in using these threaded libraries the compiler doesn't
really have to do anything to parallelize the code. If you depend on
the Fortran compiler to automatically parallelize the parts of the
code that you've written, the compiler will have to do a lot more work
to analyze your code. The performance of the resulting code is often
disappointing.
An altnerative that can be helpful is adding OpenMP directives to your
code. With OpenMP, you add comments to your code that give the
compiler explicit directions on how to parallelize the code. OpenMP
is available in Intel's C and Fortran compilers as well as in the open
source gcc compilers.
Performance --> disappointing. Are there links to see a timing result, as
follow?
Threads Elapsed time Speedup
1 ? ?
2 ? ?
3 ? ?
4 ? ?
: ? ?
: ? ?
See for example:
<a href="http://dx.doi.org/10.1007/s10589-007-9030-3">B. Borchers and
J. G. Young. Implementation of a Primal-Dual Method for SDP on a
Shared Memory Parallel Architecture. Computational Optimization and
Applications. 37(3):355-369, 2007.</a>
In particular, look at Table 2. On one particular problem (theta6 in
the table) in a section of the code (labeled "Elements" in the table)
that was compiled with automatic compiler parallelization, on an IBM
p690 computer (with 32 processors), we had speedups of
Threads Speedup
1 1
2 0.90
4 0.72
8 0.72
16 0.64
In other words, the single threaded code was faster than the code
compiled with automatic parallelization, even with 16 processors. In
fact, throwing more processors at it made things worse.
After rewriting this part of the code using OpenMP directives, the
speedups on this part of this problem improved to (see Table 4 in the
paper)
Threads Speedup
1 1
2 1.6
4 2.9
8 5.4
16 11.4
Another part of this code involved the Cholesky factorization of a
large positive definite matrix using LAPACK/BLAS. The Cholesky
factorization parallelized very effectively.
I've also done some testing with parallel efficiency of DGEMV (the
BLAS Matrix-Vector multiply routine) that showed that on a typical
consumer grade PC with an Intel Core 2 Duo processor, the memory
bandwidth limits the performance to the point that multithreading
simply isn't worth while. On a higher end workstation with a four
cores and faster front side bus and fast FB-DIMM memory I found that
using two threads on DGEMV was faster than using one thread, but
beyond that adding additional threads didn't help.
This is exactly the kind of advice I was looking for. Thank you very much!
Brain:
Who is coward?
It is not surprised auto-parallelizer and threaded blas show no speedup, and
may get worse. Equation.com also has benchmark showing almost perfect
speedup, not by auto-parallelizer and not in OpenMp. Those parallel solvers
were developed before OpenMp was introduced. The following timing result,
solution of system equations (compiler: Intel fortran), was obtained on a
baby dual 200-mhz Pentium Pro.
C:\TEMP>bench1_intel
number of equations: 2000000
Half bandwidth: 8
Processor: 1
Elapsed Time (Seconds): 51.44
CPU Time in User Mode (Seconds): 50.62
CPU Time in Kernel Mode (Seconds): 0.81
Total CPU Time (Seconds): 51.44
Processors: 2
Elapsed Time (Seconds): 26.26
CPU Time in User Mode (Seconds): 51.59
CPU Time in Kernel Mode (Seconds): 0.78
Total CPU Time (Seconds): 52.38