Vectorization in FORTRAN

Alberto Ramos

unread,

Feb 11, 2016, 6:35:22 PM2/11/16

to

The question is the following: What approach exists within the fortran
standar to achieve vectorization?? I can think about:

1) do concurrent

2) using contiguos argument to routines

Anything else?

Do the compilers actually vectorize operations, which nd in which cases?

Thanks,

A.

Ron Shepard

unread,

Feb 12, 2016, 1:26:19 AM2/12/16

to

I think your question is a little ambiguous. There is nothing in the
standard that requires a compiler to use any specific available hardware
or to use specific software techniques. However, there are aspects of
the language that allow combinations of hardware and software that
result in vectorization. Those include, for example, aliasing rules for
dummy arguments, storage association and storage sequence rules, do loop
indices that cannot be modified within the loop, trip counts that may be
determined prior to do loop execution, the ability to reorder some
arithmetic operations within statements, and the ability to reorder
statements or to extract common subexpressions from several statements.
There are some intrinsic functions that also might result in vectorized
code, such as dot_product() or matmul(). Array syntax can result in
vectorized code, but there are also situations in which the use of array
syntax hampers vectorization, so that can be a tricky situation. The
compiler can assume that functions do not have side effects within an
expression, and that allows optimizations, including perhaps
vectorization, that would not be allowed otherwise. Fortran pointers are
relatively tame compared to those of some other languages, so some
vectorization can occur even when fortran pointers are used. Allocatable
arrays on the other hand often have relatively little affect on
optimizations compared to the use of static arrays.

Yes, compilers do vectorize code, and they have done so since the 70's
when vector instructions, vector hardware, and vector registers first
became available. If you want to know which code is vectorized, then for
the most part you need to look at the machine instructions that are
generated. Or with gfortran, for example, you can sometimes look at the
intermediate code. Some compilers have the option to print out which
sections of the fortran code are vectorized.

$.02 -Ron Shepard

Thomas Koenig

unread,

Feb 12, 2016, 2:42:00 AM2/12/16

to

Alberto Ramos <albert...@desy.de> schrieb:

> The question is the following: What approach exists within the fortran
> standar to achieve vectorization??

Ron's reply already covered very many points. I just wanted to add a
couple:

Trying to adhere to the Fortran standard and to IEEE floating
point can lead to restrictions in what can be vectorized. This is
why you may need options like gcc's -ffast-math (or -Ofast) for
vectorization of some loops. These kind of options can also have
other undesirable side effects. This is also a bit of a compiler
limitation.

Modern SIMD instructions place restrictions on data alignment.
For example, operations on IEEE doubles might be restricted to
16 or 32 byte boundaries. To overcome that, compilers have to
use "peeling" to make sure that code compiles for vectorization
works OK even if the data is "only" aligned to 8 bytes. This can
introduce additional overhead. Standard Fortran lacks a way of
indicating that a variable should be aligned more strictly.
(So does standard C, by the way, but common compiler extensions
allow it, and people are more used to using these extensions).

michael...@compuserve.com

unread,

Feb 12, 2016, 5:39:17 AM2/12/16

to

On Friday, 12 February 2016 00:35:22 UTC+1, Alberto Ramos wrote:
> The question is the following: What approach exists within the fortran

> standard to achieve vectorization?
>
Note that a major objective in the design of Fortran 90 was to enable codes that would be used on the vector hardware of the day, Cray, CDC, Thinking Machines etc., to be vectorized without recourse to directives or extensions. This was achieved by the introduction of the array language and of the many new intrinsic functions, this under the influence of, for instance, APL and DAP Fortran.

Regards,

Mike Metcalf

michael siehl

unread,

Feb 12, 2016, 8:03:30 AM2/12/16

to

Hi,
indeed, vectorization (resp. array programming, SIMD parallelism, or data parallelsim) is getting an even more crucial point today and in the future, due to increasing size SIMD registers with the Intel MIC architecture.

Books:
I think, a very good overview is given in the book 'Modern Fortran - Style and Usage' by Norman S. Clerman and Walter Spector, chapter 12.2.1 (Data Parallel Programming).
I would also recommend the book 'Numerical Recipes in Fortran 90', chapter 22 (Introduction to parallel programming).

Some links:
https://software.intel.com/en-us/articles/explicit-vector-programming-in-fortran
https://software.intel.com/en-us/articles/fortran-array-data-and-arguments-and-vectorization

Nevertheless, all these references show vectorization examples with a merely procedural programming style. Thus, see also the Wikipedia article 'Array programming': https://en.wikipedia.org/wiki/Array_programming
which states:
"... The basis behind array programming and thinking is to find and exploit the properties of data where individual elements are similar or adjacent. Unlike object orientation which implicitly breaks down data to its constituent parts (or scalar quantities), array orientation looks to group data and apply a uniform handling."
Well, maybe this distinction between array programming and object orientation is only part of the truth.

To understand, take a look at 'The High Performance Fortran Handbook' by Charles H. Koelbel et al., chapter 3.5 (User-Defined Data Types). It shows (very basically) how we can combine 'object-based data design' with array programming using Fortran 9x techniques. That could allow a kind of general-purpose SIMD parallelism, provided that compilers can manage to vectorize such nested data constructs, since a main requirement is data allignment.

Best Regards
Michael

Tim Prince

unread,

Feb 12, 2016, 11:17:53 AM2/12/16

to

Even the latest compilers still exhibit situations where
auto-vectorization of DO loops may work better than array syntax. This
may be partly due to the greater emphasis on auto-vectorization of C.
OpenMP is quite important for controlling simd (vectorization) and
parallel optimization, but it doesn't play well with array notation.
But you may have intended to exclude OpenMP in your original question.

Tim Prince

unread,

Feb 12, 2016, 11:46:18 AM2/12/16

to

On 2/11/2016 6:35 PM, Alberto Ramos wrote:
>
> The question is the following: What approach exists within the fortran
> standar to achieve vectorization?? I can think about:
>
> 1) do concurrent

do concurrent was designed to improve auto-parallelization relative to
the somewhat similar forall. I have just one case where it gives the
best auto-vectorization with one compiler.

>
> 2) using contiguos argument to routines
>
> Anything else?

Evidently, certain intrinsics, such as merge and matmul, are motivated
in part by requirements for vectorization. I don't know whether to say
that about others, such as dot_product.

>
> Do the compilers actually vectorize operations, which nd in which cases?

Among many ways to answer this is to examine some traditional test cases
and modernized versions of them, e.g. https://github.com/tprince/lcd
Another is to examine your compiler reports, and tools built on them,
such as https://software.intel.com/en-us/articles/advisorxe-tutorials
>
> Thanks,
>
> A.
>
I'm not certain whether you intended to exclude the extremely important
category of directives added to most implementations of Fortran,
including OpenMP http://openmp.org/wp/

Alberto Ramos

unread,

Feb 12, 2016, 3:58:58 PM2/12/16

to

Hi,

Thanks for the answers. Very useful.

My question is whether it is actually worth to "design" a code (using *ONLY*
fortran standard) in a way that is suitable for vectorization.

A concrete example:

do i = 1, n
b = compute()
c = more_compute()
a(i) = do_something(b,c)
end do

This code might be difficult to vectorize, but one can achieve the same
result with:

do concurrent (i=1:n)
b(i)=compute()
c(i)=more_compute()
end do
call do_the_same(a,b,c)

In this second approach, one first "prepare" the data b and c in some array.
After one calls an elemental procedure. Both the do concurrent and the
elemental procedure might be vectorized. This has a cost in memory (i.e.
and c are arrays now)

My concrete is if in practice this approach might lead to a better use of
the SSEX, AVX, instructions (with which compilers? CRAY? Intel?), or if this
approach helps in using modern processors like for example the intel PHI.

On the other hand, it might be better to just stay with a normal scalar
code.

Many thanks!

A.

Tim Prince

unread,

Feb 12, 2016, 6:16:09 PM2/12/16

to

In cases of fill buffer or register pressure, it may pay off to split
large vectorized loops.
ifort has some capability for automatic "distribution" or splitting
loops, as well as directives "!dir$ distribute point" in case you wish
to dictate where splits occur.
The most relevant hardware limits are the number of independent program
accessible registers (16 for x86_64; there are several times this number
of shadow registers for hardware renaming), and the 10 fill buffers,
where SSE code (without hyperthreading) may optimize with 7-9 array
sections stored per loop. The fill buffer limitation is less relevant
with MIC and AVX512, where an entire cache line is stored even without
use of fill buffer. Past architectures had write combine buffers taking
a similar role to fill buffers, but there weren't as many.
gfortran has difficulty with vectorization of loops where there are
several results stored per loop, as optimization depends on ability to
deal with relative alignments among arrays. PGI took that step long ago,
at least for the case of common blocks. Yes, modern Fortran compilers
do take advantage of them.
If you're thinking of storing results from non-vectorizable loops in
arrays where the following work can be done vector fashion, that may pay
off. Sometimes, compilers can do this automatically. The
non-vectorizable part likely becomes the main bottleneck. Amdahl's law
evidently tells you that 50% partial vectorization gives you less than
2x vector speedup, regardless of increasing simd vector width.
Intel Fortran also has better capability than most for automatically
fusing loops so as to reduce memory traffic. This has been recognized
from the beginning as a necessity to make multiple array assignments
competitive with vectorizable loops using scalar variables. With the
increasing importance of combined threading and simd parallelism, it is
becoming more important again to combine as much work as possible in one
loop. Speeding up memory has always been one of the more expensive
components of increasing performance.
Intel(r) Xeon Phi(tm) runs out of stack space quickly, due in part to
the need for so many threads (typically 180 threads per MPI rank), so
maintaining register locality (within the 32 named 512-bit simd
registers per logical processor) is quite important.

michael siehl

unread,

Feb 12, 2016, 6:21:16 PM2/12/16

to

|Even the latest compilers still exhibit situations where
|auto-vectorization of DO loops may work better than array syntax.

Nevertheless, with explicit vector programming it's easy to find contrary examples, https://software.intel.com/en-us/articles/explicit-vector-programming-in-fortran :

"Array Notation
.
.
In the array notation version, the original value of A(1) is used instead. The DO loop cannot be safely vectorized, whereas the array assignment can."

|My question is whether it is actually worth to "design" a code (using *ONLY*
|fortran standard) in a way that is suitable for vectorization.

|..

|On the other hand, it might be better to just stay with a normal scalar
|code.

See 'Numerical Recipes in Fortran 90', chapter 22 (Introduction to parallel programming), page 969:
"On the other hand, the fact is that these constructions are parallel, and are there for you to use. If the alternative to using them is strictly serial code, you should almost always give them a try."

Best Regards
Michael

Tim Prince

unread,

Feb 12, 2016, 6:43:42 PM2/12/16

to

On 2/12/2016 6:21 PM, michael siehl wrote:
> |Even the latest compilers still exhibit situations where
> |auto-vectorization of DO loops may work better than array syntax.
>
> Nevertheless, with explicit vector programming it's easy to find contrary examples, https://software.intel.com/en-us/articles/explicit-vector-programming-in-fortran :
>
> "Array Notation
> .
> .
> In the array notation version, the original value of A(1) is used instead. The DO loop cannot be safely vectorized, whereas the array assignment can."
>
>

In the same article, they tout the Cilk(tm) Plus treatment where there
is no allowance for overlapping data sections, where it is the
programmer's responsibility not to use unsafe array assignments, as an
advantage. The tendency of Fortran compilers to vectorize array
assignments with use of a temporary heap or stack vector can cut
performance in half or worse. You could write in the intermediate
temporary array where needed when not using array syntax, if the
compiler doesn't take the same path. As omp simd syntax doesn't apply
to array assignments, you can't use it to suppress temporary arrays.
The cited example could also be vectorized (efficiently on recent
architectures) by loop reversal, with or without array syntax.

Thomas Koenig

unread,

Feb 13, 2016, 5:14:03 AM2/13/16

to

Alberto Ramos <albert...@desy.de> schrieb:

> My question is whether it is actually worth to "design" a code (using *ONLY*
> fortran standard) in a way that is suitable for vectorization.

Usually, this is quite useful. In general, it is impossible
to tell.

[...]

> In this second approach, one first "prepare" the data b and c in some array.
> After one calls an elemental procedure. Both the do concurrent and the
> elemental procedure might be vectorized. This has a cost in memory (i.e.
> and c are arrays now)

This will only happen if the elemental procedure is inlined, and is
quite simple.

Look at this:

ig25@linux-fd1f:~/Krempel/Elem> cat foo.f90
module foo
implicit none
contains
elemental function plus(a,b) result(r)
real, intent(in) :: a,b
real :: r
r = a + b
end function plus
end module foo
ig25@linux-fd1f:~/Krempel/Elem> cat bar.f90
module bar
use foo
implicit none
contains
subroutine v(a,b,c)
real, dimension(:), intent(in) :: a, b
real, dimension(:), intent(out) :: c
c = plus(a,b) ! This is line 8
end subroutine v
end module bar

program main
use bar
implicit none
real, dimension(100) :: a,b,c
call random_number(a)
call random_number(b)
call v(a,b,c)
print *,sum(c) ! This is line 19
end program main
ig25@linux-fd1f:~/Krempel/Elem> gfortran -fopt-info-vec -Ofast foo.f90
bar.f90
bar.f90:19:0: note: loop vectorized
ig25@linux-fd1f:~/Krempel/Elem> gfortran -fopt-info-vec -Ofast -flto
foo.f90 bar.f90
bar.f90:19:0: note: loop vectorized
bar.f90:8:0: note: loop vectorized

-flto (link-time optimization) allows the inlining of the function
plus, which again enables vectorization

michael siehl

unread,

Feb 14, 2016, 6:30:14 PM2/14/16

to

|In the same article, they tout the Cilk(tm) Plus treatment where there
|is no allowance for overlapping data sections, where it is the
|programmer's responsibility not to use unsafe array assignments, as an
|advantage. The tendency of Fortran compilers to vectorize array
|assignments with use of a temporary heap or stack vector can cut
|performance in half or worse. You could write in the intermediate
|temporary array where needed when not using array syntax, if the
|compiler doesn't take the same path. As omp simd syntax doesn't apply
|to array assignments, you can't use it to suppress temporary arrays.
|The cited example could also be vectorized (efficiently on recent
|architectures) by loop reversal, with or without array syntax.

Agree. Thanks very much, also for sharing all the other great insights, it's very interesting and enlightening indeed. One lesson could be, for the programer to pay good attention to possible rules about how to identify those code parts which are good candidates for vectorization or SIMD parallelism, and also to identify the code parts which are not well suited for SIMD.
I personally consider array programming to be very useful, even in situations where SIMD parallelism may not apply well with it, since array syntax allows an efficient way to code program logic to handle nested data structures (e.g. management of object collections). That works with every Fortran 9x compiler I've tried (only with some minor restrictions). Nevertheless, even with the identification of such non-SIMD candidates within our code, possibly also compiler or hardware dependent, we might still apply parallelism with that code. Since these code parts may require more execution time, they might be candidates within a MPMD approach (MIMD parallelism), where we could distribute such a candidate object to its own processor core(s) (e.g. coarray image(s)) to execute, for an example.

Best Regards
Michael S.