In cases of fill buffer or register pressure, it may pay off to split
large vectorized loops.
ifort has some capability for automatic "distribution" or splitting
loops, as well as directives "!dir$ distribute point" in case you wish
to dictate where splits occur.
The most relevant hardware limits are the number of independent program
accessible registers (16 for x86_64; there are several times this number
of shadow registers for hardware renaming), and the 10 fill buffers,
where SSE code (without hyperthreading) may optimize with 7-9 array
sections stored per loop. The fill buffer limitation is less relevant
with MIC and AVX512, where an entire cache line is stored even without
use of fill buffer. Past architectures had write combine buffers taking
a similar role to fill buffers, but there weren't as many.
gfortran has difficulty with vectorization of loops where there are
several results stored per loop, as optimization depends on ability to
deal with relative alignments among arrays. PGI took that step long ago,
at least for the case of common blocks. Yes, modern Fortran compilers
do take advantage of them.
If you're thinking of storing results from non-vectorizable loops in
arrays where the following work can be done vector fashion, that may pay
off. Sometimes, compilers can do this automatically. The
non-vectorizable part likely becomes the main bottleneck. Amdahl's law
evidently tells you that 50% partial vectorization gives you less than
2x vector speedup, regardless of increasing simd vector width.
Intel Fortran also has better capability than most for automatically
fusing loops so as to reduce memory traffic. This has been recognized
from the beginning as a necessity to make multiple array assignments
competitive with vectorizable loops using scalar variables. With the
increasing importance of combined threading and simd parallelism, it is
becoming more important again to combine as much work as possible in one
loop. Speeding up memory has always been one of the more expensive
components of increasing performance.
Intel(r) Xeon Phi(tm) runs out of stack space quickly, due in part to
the need for so many threads (typically 180 threads per MPI rank), so
maintaining register locality (within the 32 named 512-bit simd
registers per logical processor) is quite important.