Is there a way to get gfortran to use SSE?

James Van Buskirk

unread,

Dec 7, 2011, 12:43:13 PM12/7/11

to

How do you tell gfortran that your arrays are all aligned 0 mod 16
so that it can use instructions like movaps instead of movss to
load and process data? I couldn't find any command-line switch.
There some some stuff about built-in functions in the gcc
documentation, but I don't know whether you can access then from
gfortran.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

Tim Prince

unread,

Dec 7, 2011, 12:52:50 PM12/7/11

to

On 12/7/2011 12:43 PM, James Van Buskirk wrote:
> How do you tell gfortran that your arrays are all aligned 0 mod 16
> so that it can use instructions like movaps instead of movss to
> load and process data? I couldn't find any command-line switch.
> There some some stuff about built-in functions in the gcc
> documentation, but I don't know whether you can access then from
> gfortran.
>

The question might better be asked on for...@gcc.gnu.org
The earliest gfortran versions which could use movaps required
-mtune=barcelona to do that. Current versions have similar abilities
with corei7 options. Latest gfortran versions have more usage of
movups, which is equivalent to movaps for aligned loads on recent CPU
models.
I haven't seen any option to assert alignments such as gcc has.

--
Tim Prince

James Van Buskirk

unread,

Dec 7, 2011, 1:08:13 PM12/7/11

to

"Tim Prince" <tpr...@computer.org> wrote in message
news:9k9nfk...@mid.individual.net...

Even movups would be an improvement over what I'm currently seeing.

gfortran -c -fno-range-check -O3 -funsafe-loop-optimizations
-march=native -mtune=native

Where the processor is a 65-nm Core 2 Duo. Generated code is over
an order of magnitude slower than assembly language code even though
the compiler could pretty much transform the Fortran code to SSE
code on a one instruction per line of code basis.

timprince

unread,

Dec 7, 2011, 1:22:14 PM12/7/11

to

I should have said that the first option which would enable gfortran to
use movups was the -mtune=barcelona.
The original core 2 duo would perform better by splitting misaligned
loads rather than using movups, so it was difficult for compilers to do
the right thing consistently, without an alignment assertion such as you
wanted. Mis-informing the compiler by setting a specific CPU type
rather than native could work to your advantage if you had sufficient
aligned loads not recognized by the compiler.

--
Tim Prince

James Van Buskirk

unread,

Dec 7, 2011, 2:01:33 PM12/7/11

to

"timprince" <tpr...@computer.org> wrote in message
news:9k9p6n...@mid.individual.net...

> I should have said that the first option which would enable gfortran to
> use movups was the -mtune=barcelona.
> The original core 2 duo would perform better by splitting misaligned loads
> rather than using movups, so it was difficult for compilers to do the
> right thing consistently, without an alignment assertion such as you
> wanted. Mis-informing the compiler by setting a specific CPU type rather
> than native could work to your advantage if you had sufficient aligned
> loads not recognized by the compiler.

Can recent versions of ifort do this? Here is a benchmark, to run
it all you have to do is to post a disassembly of compiler output.
It comes from a paper

http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=6066463

which looks kind of like an SSE version of kissfft. The Fortran
version is:

! File: hll128.f90
! Compile with:
! gfortran -c -fno-range-check hll128.f90

subroutine fft128(x,y,pTwiddle,pBinv) bind(C,name='fft128')
use ISO_C_BINDING
implicit none
integer, parameter :: logn = 7
integer, parameter :: N = 2**logn
real(C_FLOAT) x(0:N-1), y(0:N-1)
type(C_PTR) pTwiddle(2*(logn-2))
integer(C_INT8_T) pBinv(0:N-1)
integer level
real(C_FLOAT) m0(0:3),m1(0:3),m2(0:3),m3(0:3),m4(0:3), &
m5(0:3),m6(0:3)
real(C_FLOAT), pointer :: pCosa(:)
real(C_FLOAT), pointer :: pSina(:)
integer block
integer i, j, k
integer(C_INT), parameter :: signer(0:3) = &
[0_C_INT,0_C_INT,int(Z'80000000',C_INT),0_C_INT]

do level = logn, 3, -1
call C_F_POINTER(pTwiddle(2*(logn-level)+1),pCosa, &
[2**(level-1)])
call C_F_POINTER(pTwiddle(2*(logn-level)+2),pSina, &
[2**(level-1)])
do block = 0, 2**logn-1, 2**level
do k = 0, 2**(level-1)-1, 4
m0 = x(block+k:block+k+3)
m1 = x(block+k+2**(level-1):block+k+2**(level-1)+3)
m2 = y(block+k:block+k+3)
m3 = y(block+k+2**(level-1):block+k+2**(level-1)+3)
m4 = m0+m1
m5 = m2+m3
x(block+k:block+k+3) = m4
y(block+k:block+k+3) = m5
m4 = m0-m1
m5 = m2-m3
m0 = pCosa(k+1:k+4)
m1 = pSina(k+1:k+4)
m2 = m0*m4
m3 = m1*m5
m6 = m2+m3
x(block+k+2**(level-1):block+k+2**(level-1)+3) = m6
m2 = m0*m5
m3 = m1*m4
m6 = m2-m3
y(block+k+2**(level-1):block+k+2**(level-1)+3) = m6
end do
end do
end do
do block = 0, 2**logn-1, 4
m0 = x(block:block+3)
m1 = y(block:block+3)
m2 = [m0(0),m1(0),m0(1),m1(1)]
m3 = [m0(2),m1(2),m0(3),m1(3)]
m4 = m2+m3
m5 = m2-m3
! m5(2) = -m5(2)
m5 = transfer(ieor(transfer(m5,signer),signer),m5)
m0 = [m4(0),m4(2),m5(0),m5(3)]
m1 = [m4(1),m4(3),m5(1),m5(2)]
x(block:block+3) = m0
y(block:block+3) = m1
end do
do block = 0, 2**logn-1, 4
m0 = x(block:block+3)
m1 = y(block:block+3)
m2 = [m0(0),m0(2),m1(0),m1(2)]
m3 = [m0(1),m0(3),m1(1),m1(3)]
m4 = m2+m3
m5 = m2-m3
m0 = [m4(0),m5(0),m4(1),m5(1)]
m1 = [m4(2),m5(2),m4(3),m5(3)]
x(block:block+3) = m0
y(block:block+3) = m1
end do
do i = 0, N-1
j = pBinv(i)
j = iand(j,int(Z'ff'))
if(j > i) then
x([i,j]) = x([j,i])
y([i,j]) = y([j,i])
end if
end do
end subroutine fft128

And the hoped-for disassembly might look something like:

format MS64 coff
; File: fasm128a.asm
; Assembled with: fasm fasm128a.asm
; See test file fft128a.f90

section 'CODE' code readable executable align 16
align 16
public fft128
fft128:
; Save registers
push rbx
push rsi
push rdi
push rbp
push r12
mov [rsp+72], rcx ; x
mov [rsp+64], rdx ; y
; mov [rsp+56], r8 ; pTwiddle
mov [rsp+48], r9 ; pBinv
mov r10, 256
mov r12, 1
point8_outer:
mov rsi, [r8] ; cos array
mov rdi, [r8+8] ; sin array
add rsi, r10
add rdi, r10
mov rax, [rsp+72] ; x
lea rcx, [rax+2*r10]
add rax, r10
mov rdx, [rsp+64] ; y
lea rbx, [rdx+2*r10]
add rdx, r10
mov r9, r12
point8_middle:
xor ebp, ebp
sub rbp, r10
point8_inner:
movaps xmm0, [rax+rbp] ; x(i:i+3)
movaps xmm1, [rcx+rbp] ; x(j:j+3)
movaps xmm2, [rdx+rbp] ; y(i:i+3)
movaps xmm3, [rbx+rbp] ; y(j:j+3)
movaps xmm4, xmm0 ; x(i:i+3)
addps xmm4, xmm1 ; x(i:i+3)+x(j:j+3)
movaps [rax+rbp], xmm4
movaps xmm4, xmm2 ; y(i:i+3)
addps xmm4, xmm3 ; y(i:i+3)+y(j:j+3)
movaps [rdx+rbp], xmm4
subps xmm0, xmm1 ; x(i:i+3)-x(j:j+3)
subps xmm2, xmm3 ; y(i:i+3)-y(j:j+3)
movaps xmm1, [rsi+rbp] ; cos(theta)
movaps xmm3, [rdi+rbp] ; sin(theta)
movaps xmm4, xmm0 ; x(i:i+3)-x(j:j+3)
mulps xmm4, xmm1 ; cos(theta)*(x(i:i+3)-x(j:j+3))
mulps xmm0, xmm3 ; sin(theta)*(x(i:i+3)-x(j:j+3))
mulps xmm3, xmm2 ; sin(theta)*(y(i:i+3)-y(j:j+3))
mulps xmm2, xmm1 ; cos(theta)*(y(i:i+3)-y(j:j+3))
addps xmm4, xmm3 ; cos(theta)*(x(i:i+3)-x(j:j+3))
; +sin(theta)*(y(i:i+3)-y(j:j+3))
movaps [rcx+rbp], xmm4
subps xmm2, xmm0 ; cos(theta)*(y(i:i+3)-y(j:j+3))
; -sin(theta)*(x(i:i+3)-x(j:j+3))
movaps [rbx+rbp], xmm2
add rbp, 16
js point8_inner
lea rax, [rax+2*r10]
lea rcx, [rcx+2*r10]
lea rdx, [rdx+2*r10]
lea rbx, [rbx+2*r10]
sub r9, 1
ja point8_middle
add r8, 16
shr r10, 1
shl r12, 1
cmp r10, 8
ja point8_outer
mov eax, 80000000h
movd xmm5, eax
pslldq xmm5, 8
mov rax, [rsp+72] ; x
mov rdx, [rsp+64] ; y
mov rbp, -512
sub rax, rbp
sub rdx, rbp
point4:
movaps xmm0, [rax+rbp]
movaps xmm1, [rdx+rbp]
movaps xmm2, xmm0
unpcklps xmm2, xmm1
unpckhps xmm0, xmm1
movaps xmm4, xmm2
addps xmm4, xmm0
subps xmm2, xmm0
xorps xmm2, xmm5
movaps xmm0, xmm4
shufps xmm0, xmm2, 0c8h ; 11001000
shufps xmm4, xmm2, 9dh ; 10011101
movaps [rax+rbp], xmm0
movaps [rdx+rbp], xmm4
add rbp, 16
js point4
mov rbp, -512
point2:
movaps xmm0, [rax+rbp]
movaps xmm1, [rdx+rbp]
movaps xmm2, xmm0
shufps xmm2, xmm1, 88h ; 10001000
shufps xmm0, xmm1, 0ddh ; 11011101
movaps xmm4, xmm2
addps xmm4, xmm0
subps xmm2, xmm0
movaps xmm0, xmm4
unpcklps xmm4, xmm2
unpckhps xmm0, xmm2
movaps [rax+rbp], xmm4
movaps [rdx+rbp], xmm0
add rbp, 16
js point2
mov rcx, [rsp+48] ; pBinv
mov rax, [rsp+72] ; x
mov rdx, [rsp+64] ; y
xor ebp, ebp
bitrev:
movzx ebx, byte[rcx+rbp]
cmp ebp, ebx
jnb bitskip
mov esi, [rax+4*rbp]
mov edi, [rax+4*rbx]
mov [rax+4*rbx], esi
mov [rax+4*rbp], edi
mov esi, [rdx+4*rbp]
mov edi, [rdx+4*rbx]
mov [rdx+4*rbx], esi
mov [rdx+4*rbp], edi
bitskip:
inc ebp
cmp ebp, 128
jb bitrev
; Restore registers
cleanup:
pop r12
pop rbp
pop rdi
pop rsi
pop rbx
ret

align 16
public _rdtsc
_rdtsc:
rdtsc
shl rdx, 32
or rax, rdx
ret

Just a brief glance at the disassembly should suffice to determine
whether the compiler is getting the idea.

Steven G. Kargl

unread,

Dec 7, 2011, 2:42:58 PM12/7/11

to

On Wed, 07 Dec 2011 12:01:33 -0700, James Van Buskirk wrote:

> "timprince" <tpr...@computer.org> wrote in message
> news:9k9p6n...@mid.individual.net...
>
>> I should have said that the first option which would enable gfortran to
>> use movups was the -mtune=barcelona.
>> The original core 2 duo would perform better by splitting misaligned loads
>> rather than using movups, so it was difficult for compilers to do the
>> right thing consistently, without an alignment assertion such as you
>> wanted. Mis-informing the compiler by setting a specific CPU type rather
>> than native could work to your advantage if you had sufficient aligned
>> loads not recognized by the compiler.
>
> Can recent versions of ifort do this? Here is a benchmark, to run
> it all you have to do is to post a disassembly of compiler output.
> It comes from a paper
>
> http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=6066463
>
> which looks kind of like an SSE version of kissfft. The Fortran
> version is:
>
> ! File: hll128.f90
> ! Compile with:
> ! gfortran -c -fno-range-check hll128.f90
>

(code deleted)

> And the hoped-for disassembly might look something like:
>
> format MS64 coff
> ; File: fasm128a.asm
> ; Assembled with: fasm fasm128a.asm
> ; See test file fft128a.f90

(assembly deleted)

> Just a brief glance at the disassembly should suffice to determine
> whether the compiler is getting the idea.

gfortran understands the -S option of GCC. You can directly produce
the assembly code. For the Fortran you had above, on my opteron
system I get

troutmask:sgk[203] gfc4x -S -O3 -fno-range-check jvb.f90 && grep mova jvb.s
movaps %xmm14, %xmm15
movaps %xmm12, %xmm15
movaps %xmm10, %xmm15
movaps %xmm8, %xmm15
movaps %xmm5, %xmm15
movaps %xmm3, %xmm15
movaps %xmm2, %xmm15
movaps %xmm5, %xmm15
movaps %xmm14, %xmm1
movaps %xmm3, %xmm15
movaps %xmm12, %xmm1
movaps %xmm2, %xmm15
movaps %xmm10, %xmm1
movaps %xmm5, %xmm0
movaps %xmm0, 128(%rsp)
movaps %xmm7, %xmm8
movaps %xmm5, %xmm15
movaps %xmm1, %xmm15
movaps %xmm3, %xmm8
movaps %xmm7, %xmm11
movaps %xmm5, %xmm10
movaps %xmm3, %xmm9
movaps %xmm1, %xmm8

---
steve

FX

unread,

Dec 7, 2011, 3:03:46 PM12/7/11

to

> How do you tell gfortran that your arrays are all aligned 0 mod 16
> so that it can use instructions like movaps instead of movss to
> load and process data?

In GNU C code, you can use attributes
(http://gcc.gnu.org/onlinedocs/gcc/Variable-Attributes.html) to do so:

double myarray[] __attribute__((aligned(16)))

Until gfortran supports attributes, I don't believe there is a clean way
to do so.

--
FX

James Van Buskirk

unread,

Dec 7, 2011, 11:54:01 PM12/7/11

to

"Steven G. Kargl" <s...@REMOVEtroutmask.apl.washington.edu> wrote in message
news:jbofk2$kdm$1...@dont-email.me...

Yeah, and that shows that movaps is never being used to load data,
just to copy registers. Looking at the inner loop it's all movss,
addss, subss, mulss. A shame that although SIMD fits nicely with
Fortran, there's no way for the compiler to use it because it
can't guarantee alignment.

JB

unread,

Dec 8, 2011, 4:30:38 AM12/8/11

to

On 2011-12-08, James Van Buskirk <not_...@comcast.net> wrote:
> Yeah, and that shows that movaps is never being used to load data,
> just to copy registers. Looking at the inner loop it's all movss,
> addss, subss, mulss. A shame that although SIMD fits nicely with
> Fortran, there's no way for the compiler to use it because it
> can't guarantee alignment.

The GCC vectorizer supports something called (in compiler jargon) loop
peeling, where essentially it first does up to N scalar iterations of
a loop, and when the alignment is suitable it switches to a vectorized
loop.

For instance, in the example you showed where the type is
REAL(C_FLOAT), the compiler knows that such variables are at least
aligned on a 4 byte boundary, whereas SSE prefers (or needs, depending
on the hw) 16 byte aligned. Thus it needs to do 0-3 scalar loop
iterations before it hits a 16 byte boundary and it can switch to the
vectorized loop.

The above is of course less efficient than if one could somehow tell
the compiler that, yes, this array is 16 byte aligned, but it
shouldn't per se make vectorization impossible.

You can try to figure out why it doesn't vectorize with
-ftree-vectorizer-verbose=N with N=[0,7].

It's not unheard of that the vectorizer fares poorly with "hand
optimized" inner loops; it prefers simple and obvious loops which
increases the chance that it can detect the loop pattern.

--
JB

Tim Prince

unread,

Dec 8, 2011, 8:22:05 AM12/8/11

to

[tim@tim-knf1 net]$ gfortran -O3 -ftree-vectorizer-verbose=6 -S
-fno-range-check hll128.f90

hll128.f90:75: note: not vectorized: control flow in loop.
hll128.f90:63: note: not vectorized: unexpected loop form.
hll128.f90:49: note: not vectorized: control flow in loop.
hll128.f90:19: note: not vectorized: multiple nested loops.
hll128.f90:24: note: not vectorized: unexpected loop form.
hll128.f90:24: note: not vectorized: Bad inner loop.
hll128.f90:25: note: not vectorized: unexpected loop form.
hll128.f90:1: note: vectorized 0 loops in function.

If you set ifort -vec-report3, you should learn that the compiler
considers vectorization of the pointer targets "inefficient" (on account
of their alignment being unknown).
It's reasonable to expect a compiler to set up those local 16-byte
arrays with alignment, but it takes more than that to accomplish James'
goal.
If you were a fan of Intel's directives, you could file a feature
request to have them considered for application to pointer target
alignments, or hope for an explanation why it's considered inadvisable
in this context. I'm not prepared to guess whether that would enable
ifort to optimize this "blocked" source code, or whether removing the
blocking would produce better results.