Fortran interfaces to Intel Intrinsics (AVX,AVX2,AVX3)

bernard gingold

unread,

Dec 13, 2018, 4:11:45 AM12/13/18

to

I decided to write a set of Fortran interfaces to specific part of Intel AVX-AVX3 intrinsics.
Vector intrinsics are wrapped in thin C layer which calls appropriate simd intrinsic:
For your convenience I post a simple code snippet.
BEGIN example:

Compiler: icc 15

C-side

_m256d v256_add_pd(__m256d v1, __m256d v2) {
return(_mm256_add_pd(v1,v2));
}
--------------------------------------------------------------
Compiler: ifort 15

Fortran-side

type, bind(c) :: m256d
real(c_double), dimension(0:3) :: v
end type m256d

interface
type(m256d) function v256_add_pd(v1,v2) bind(c,name='v256_add_pd')
use , intrinsic :: iso_c_binding
import :: m256d
type(m256d), intent(in) :: v1
type(m256d), intent(in) :: v2
end function v256_add_pd
end interface

END example

The code compiles and links fine, but during the runtime interoperability between Fortran code and C code fails providing wrong result.
During the debugging C-side assembly code (debug build) calls vaddpd machine code instruction which returns it's result in YMM0 register.
Fortran-side compiler generated assembly code did not issue a load instruction
to copy a content of YMM0 register to automatic variable of type m256d which should be compatible with C declaration of __m256d struct.
C struct '__m256d' contains interoperable data member: a static array: double v[4] and struct is aligned on 32-byte boundary (defined in immintrin.h provided by Intel compiler).
I suppose that ifort compiler failed to recognize return type of my C wrapper to Intel _mm256_add_pd intrinsic being fully compatible with derived type 'm256d'.
Problem could stem probably from disaggregation of __m256d struct by placing its content inside YMM register.

Louis Krupp

unread,

Dec 13, 2018, 5:00:52 AM12/13/18

to

Stupid question: Does the Fortran function need to declared its
arguments as passed by value?

For the benefit of those of us who aren't 100% sure what the code is
supposed to do (add two 256-bit values?), a small main program with an
indication of the expected result would be helpful.

Louis

bernard gingold

unread,

Dec 13, 2018, 5:20:30 AM12/13/18

to

Hi Louis,
Thanks for the reply.
I added a keyword 'value' unfortunately it did not help.
The code (C-side) is performing a vector floating-point addition of two __m256d data types.
Please have a look here: https://software.intel.com/en-us/node/524042

BEGIN example

PROGRAM

!DIR$ ATTRIBUTES ALIGN : 64 :: vres,v1,v2 ! Alignment on 32-byte boundary
type(m256d) :: vres,v1,v2
! Exec code .....

vres.v = 0.0_8
v1.v = 3.14_8
v2.v = 2.71_8

vres = v256_add_pd(v1,v2)
print*, "v256_add_pd: -- result:", vres.v

! Upon the end of execution vres.v shall contain a result of vector
! addition of {3.14+2.71, 3.14+2.71, 3.14+2.71, 3.14+2.71}
! Unfortunately 'vres.v' contain an junk values (stack uninitialized
! memory pattern)

END PROGRAM

bernard gingold

unread,

Dec 13, 2018, 5:21:33 AM12/13/18

to

> !DIR$ ATTRIBUTES ALIGN : 32 :: vres,v1,v2 ! Alignment on 32-byte boundary

bernard gingold

unread,

Dec 13, 2018, 6:57:48 AM12/13/18

to

Disassembly of C function:

push rbp #6.17
mov rbp, rsp #6.17
and rsp, -32 #6.17
vmovapd ymm1, YMMWORD PTR .L_2il0floatpacket.1[rip] #10.19
vaddpd ymm2, ymm1, YMMWORD PTR .L_2il0floatpacket.0[rip] #11.12
vxorpd ymm0, ymm0, ymm0 #8.27
vmovapd YMMWORD PTR [-32+rsp], ymm0 #8.25
Result --> vmovapd YMMWORD PTR [-32+rsp], ymm2
mov rsp, rbp #13.1
pop rbp #13.1
ret

Milan Curcic

unread,

Dec 13, 2018, 9:02:14 AM12/13/18

to

Won't the compiler use these intrinsics on its own where appropriate, if compiled with -AVX or -AVX2 (I may be misremembering exact flag name)? Why the need for Fortran bindings?

bernard gingold

unread,

Dec 13, 2018, 9:13:06 AM12/13/18

to

On Thursday, December 13, 2018 at 3:02:14 PM UTC+1, Milan Curcic wrote:
> Won't the compiler use these intrinsics on its own where appropriate, if compiled with -AVX or -AVX2 (I may be misremembering exact flag name)? Why the need for Fortran bindings?

I'm not sure if Intel Fortran compiler during the compilation phase will translate Fortran array syntax to C intrinsics and later convert it to machine code.
I suppose that some intermediate representation of array syntax will be in use
during the optimization phase and later this representation will be transformed to machine code.

Now in regards to Fortran bindings I would like to have a possibility to manually vectorise the code with the help of SIMD intrinsics just as it is done in the case of C or C++ code.

Message has been deleted

Steve Lionel

unread,

Dec 13, 2018, 11:12:58 AM12/13/18

to

On 12/13/2018 9:13 AM, bernard gingold wrote:
> I'm not sure if Intel Fortran compiler during the compilation phase will translate Fortran array syntax to C intrinsics and later convert it to machine code.
> I suppose that some intermediate representation of array syntax will be in use
> during the optimization phase and later this representation will be transformed to machine code.

The Intel compilers themselves don't use the "C intrinsics". Rather,
they recognize operations that can be (they think) made faster by use of
the AVX instructions (if you have told the compiler it's ok to use
them.) If you look at the assembly for array-heavy Fortran code you'll
often see the AVX instructions there.

The instruction intrinsics are in C for C programmers who want to use
them specifically. I expect they are used a lot in the math library and
MKL. I would not recommend their use in Fortran programs.

--
Steve Lionel
Retired Intel Fortran developer/support
Email: firstname at firstnamelastname dot com
Twitter: @DoctorFortran
LinkedIn: https://www.linkedin.com/in/stevelionel
Blog: http://intel.com/software/DrFortran

bernard gingold

unread,

Dec 13, 2018, 11:34:04 AM12/13/18

to

Steve thanks for the answer.

Finally I was able to call intrinsics-based code successfully. Declaration of C wrapper had been changed to void function(double * __restrict, double * __restrict, double * __restrict) when in turn Fortran interface was changed to subroutine declaration.
Now everything works correctly and only one thing bothers me -- two load instructions and one store instruction that is the cost for single vector addition. I have had an expectation even before working on that example, that I would be able to keep a result in YMM register and no matter which combination I tried it did not succeeded.

Steve Lionel

unread,

Dec 13, 2018, 4:04:58 PM12/13/18

to

On 12/13/2018 11:34 AM, bernard gingold wrote:
> Finally I was able to call intrinsics-based code successfully. Declaration of C wrapper had been changed to void function(double * __restrict, double * __restrict, double * __restrict) when in turn Fortran interface was changed to subroutine declaration.
> Now everything works correctly and only one thing bothers me -- two load instructions and one store instruction that is the cost for single vector addition. I have had an expectation even before working on that example, that I would be able to keep a result in YMM register and no matter which combination I tried it did not succeeded.

I'm going to suggest that you are wasting your time with this. The
compiler does a pretty good job of estimating performance gains, and can
also keep track of a lot more if you're not using intrinsics. On the
topic of AVX in particular, there is a penalty for starting to use AVX
that can be heavy if you have sparse uses of AVX, as the processor will
go "in and out" of AVX mode. The Intel compiler understands this and
won't use AVX unless it sees a clear advantage.

I am, in general, not in favor of "micro-optimization", especially if
you haven't done the profiling and analysis to see if the code you're
attacking has a meaningful impact. Just because AVX instructions exist
that doesn't mean you will always benefit from their use.

bernard gingold

unread,

Dec 14, 2018, 3:00:47 AM12/14/18

to

I work on radar simulation project with heavy usage of complex-representation of radar signals.
I did an investigation of complex arithmetic handled by ifort and it does not look promising, especially when complex division is taken into account.
Ifort eagerly generates scalar x87 code no matter which coding style is used i.e. array syntax or do-loops.
I came to conclusion, that decomposing array of complex numbers into two different streams i.e. real part array and imaginary part array will probably enable at least usage of SIMD scalar code.
Intersting what is the cost of firing up x87 unit and cost of bypass between SIMD and x87 units.
An example
movups xmm0, XMMWORD PTR [-16+test_$B.0.1+rcx] #21.9
movups xmm1, XMMWORD PTR [-16+test_$A.0.1+rcx] #21.9
movsd QWORD PTR [-8+rsp], xmm0 #21.9
inc sil #22.6
fld QWORD PTR [-8+rsp] #21.9
unpckhpd xmm0, xmm0 #21.9
fld st(0) #21.9
movsd QWORD PTR [-8+rsp], xmm0 #21.9
fmul st, st(1) #21.9
fld QWORD PTR [-8+rsp] #21.9
fld st(0) #21.9
fmul st, st(1) #21.9
movsd QWORD PTR [-8+rsp], xmm1 #21.9
faddp st(2), st #21.9
fxch st(1) #21.9
fdivr st, st(3) #21.9
fld QWORD PTR [-8+rsp] #21.9
unpckhpd xmm1, xmm1 #21.9
fld st(0) #21.9
movsd QWORD PTR [-8+rsp], xmm1 #21.9
fmul st, st(4) #21.9
fxch st(1) #21.9
fmul st, st(3) #21.9
fld QWORD PTR [-8+rsp] #21.9
fld st(0) #21.9
fmulp st(5), st #21.9
fxch st(4) #21.9
faddp st(2), st #21.9
fxch st(1) #21.9
fmul st, st(2) #21.9
fstp QWORD PTR [-8+rsp] #21.9
fxch st(3) #21.9
fmulp st(2), st #21.9
movsd xmm3, QWORD PTR [-8+rsp] #21.9
fxch st(2) #21.9
fsubp st(1), st #21.9
fmulp st(1), st #21.9
fstp QWORD PTR [-8+rsp] #21.9
movsd xmm2, QWORD PTR [-8+rsp] #21.9
unpcklpd xmm3, xmm2 #21.9
movups XMMWORD PTR [-16+test_$C.0.1+rcx], xmm3 #21.9
add rcx, 16 #22.6
cmp sil, 64 #22.6
jle ..B1.6 # Prob 98% #22.6
fstp st(0) #

bernard gingold

unread,

Dec 14, 2018, 3:03:14 AM12/14/18

to

I have forgotten to add source code:
Compiler: IFORT 19.0.0, optimization: -O3
subroutine test()
!DIR$ ATTRIBUTES ALIGN : 64 :: c,a,b
complex(kind=8), dimension(64), volatile :: c
complex(kind=8), dimension(64) :: a
complex(kind=8), dimension(64) :: b
integer(kind=4) :: i
integer(kind=4), parameter :: length = 64_4
! Exec code ...
c = 0.0_8
do i = 1_4, length
! a(i) = CMPLX(real(i,kind=8),real(i,kind=8))
a(i) = CMPLX(2.71_8,2.71_8)
b(i) = CMPLX(3.14_8,3.14_8)
end do
!c = a + b
! c = a / b
do i = 1_4, length
c(i) = a(i) / b(i)
end do

end subroutine

Eugene Epshteyn

unread,

Dec 14, 2018, 10:00:26 AM12/14/18

to

> complex(kind=8), dimension(64), volatile :: c

Why is "c" declared as volatile?

Eugene Epshteyn

unread,

Dec 14, 2018, 10:04:33 AM12/14/18

to

bernard gingold

unread,

Dec 14, 2018, 10:23:29 AM12/14/18

to

On Friday, December 14, 2018 at 4:00:26 PM UTC+1, Eugene Epshteyn wrote:
> > complex(kind=8), dimension(64), volatile :: c
>
> Why is "c" declared as volatile?

"c" was declared as volatile in order to force compiler to generate code of computational do-loop.

bernard gingold

unread,

Dec 14, 2018, 10:31:05 AM12/14/18

to

Follow up message:
Now array "c" is not declared as a volatile and ifort still generates x87 code.

subroutine test()
!DIR$ ATTRIBUTES ALIGN : 64 :: c,a,b

complex(kind=8), dimension(64) :: c

complex(kind=8), dimension(64) :: a
complex(kind=8), dimension(64) :: b
integer(kind=4) :: i
integer(kind=4), parameter :: length = 64_4
! Exec code ...
c = 0.0_8
do i = 1_4, length

a(i) = CMPLX(real(i,kind=8),real(i,kind=8))!<--- If this line of code
! is uncommented ifort as expected

!a(i) = CMPLX(2.71_8,2.71_8)

b(i) = CMPLX(3.14_8,3.14_8)
end do
!c = a + b

c = a / b

! do i = 1_4, length
! c(i) = a(i) / b(i)
! end do
print*, c
end subroutine

bernard gingold

unread,

Dec 14, 2018, 10:36:38 AM12/14/18

to

On Friday, December 14, 2018 at 4:31:05 PM UTC+1, bernard gingold wrote:
> On Friday, December 14, 2018 at 4:23:29 PM UTC+1, bernard gingold wrote:
> > On Friday, December 14, 2018 at 4:00:26 PM UTC+1, Eugene Epshteyn wrote:
> > > > complex(kind=8), dimension(64), volatile :: c
> > >
> > > Why is "c" declared as volatile?
> >
> > "c" was declared as volatile in order to force compiler to generate code of computational do-loop.
>
> Follow up message:
> Now array "c" is not declared as a volatile and ifort still generates x87 code.
>
> subroutine test()
> !DIR$ ATTRIBUTES ALIGN : 64 :: c,a,b
> complex(kind=8), dimension(64) :: c
> complex(kind=8), dimension(64) :: a
> complex(kind=8), dimension(64) :: b
> integer(kind=4) :: i
> integer(kind=4), parameter :: length = 64_4
> ! Exec code ...
> c = 0.0_8
> do i = 1_4, length
> a(i) = CMPLX(real(i,kind=8),real(i,kind=8))!<--- If this line of code
> ! is uncommented ifort as expected

! cannot perform computation at
> !compile time and generates x87 !code

> !a(i) = CMPLX(2.71_8,2.71_8)
> b(i) = CMPLX(3.14_8,3.14_8)
> end do
> !c = a + b
> c = a / b
> ! do i = 1_4, length
> ! c(i) = a(i) / b(i)
> ! end do
> print*, c
> end subroutine

And ifort optimization -O3 assembly output.

movsd QWORD PTR [rsp], xmm4 #17.6
lea rdx, QWORD PTR [1+rax] #14.17
fld QWORD PTR [rsp] #17.6
lea rcx, QWORD PTR [2+rax] #14.17
movsd QWORD PTR [rsp], xmm8 #17.6
fld st(0) #17.6
fmul st, st(1) #17.6
lea rsi, QWORD PTR [3+rax] #14.17
cvtdq2ps xmm10, xmm7 #12.17
fld QWORD PTR [rsp] #17.6
movaps xmm9, xmm10 #12.17
fld st(0) #17.6
lea rdi, QWORD PTR [4+rax] #14.17
fmul st, st(1) #17.6
mov r8, rdi #14.10
unpcklps xmm9, xmm10 #12.17
paddd xmm7, xmm5 #12.10
cvtps2pd xmm11, xmm9 #12.10
faddp st(2), st #17.6
fxch st(1) #17.6
fdivr st, st(3) #17.6
shl rdx, 4 #14.10
movhlps xmm9, xmm9 #12.10
cvtps2pd xmm13, xmm9 #12.10
movups XMMWORD PTR [-16+test_$A.0.1+rdx], xmm11 #12.10
movsd QWORD PTR [rsp], xmm11 #17.6
fld QWORD PTR [rsp] #17.6
unpckhpd xmm11, xmm11 #17.6
fld st(0) #17.6
movsd QWORD PTR [rsp], xmm11 #17.6
fmul st, st(4) #17.6
fxch st(1) #17.6
fmul st, st(3) #17.6
fld QWORD PTR [rsp] #17.6
fld st(0) #17.6
fmulp st(5), st #17.6
shl rcx, 4 #14.10
fxch st(4) #17.6
faddp st(2), st #17.6
fxch st(1) #17.6
fmul st, st(2) #17.6
fstp QWORD PTR [rsp] #17.6
fxch st(3) #17.6
fmulp st(2), st #17.6
movsd xmm1, QWORD PTR [rsp] #17.6
fxch st(2) #17.6
fsubp st(1), st #17.6
fmulp st(1), st #17.6
fstp QWORD PTR [rsp] #17.6
movsd xmm12, QWORD PTR [rsp] #17.6
movsd QWORD PTR [rsp], xmm4 #17.6
fld QWORD PTR [rsp] #17.6
movsd QWORD PTR [rsp], xmm8 #17.6
fld st(0) #17.6
fmul st, st(1) #17.6
fld QWORD PTR [rsp] #17.6
fld st(0) #17.6
fmul st, st(1) #17.6
movups XMMWORD PTR [-16+test_$A.0.1+rcx], xmm13 #12.10
faddp st(2), st #17.6
fxch st(1) #17.6
fdivr st, st(3) #17.6
movsd QWORD PTR [rsp], xmm13 #17.6
fld QWORD PTR [rsp] #17.6
unpckhpd xmm13, xmm13 #17.6
fld st(0) #17.6
movsd QWORD PTR [rsp], xmm13 #17.6
fmul st, st(4) #17.6
fxch st(1) #17.6
fmul st, st(3) #17.6
fld QWORD PTR [rsp] #17.6
fld st(0) #17.6
fmulp st(5), st #17.6
unpckhps xmm10, xmm10 #12.17
fxch st(4) #17.6
faddp st(2), st #17.6
cvtps2pd xmm3, xmm10 #12.10
fxch st(1) #17.6
fmul st, st(2) #17.6
fstp QWORD PTR [rsp] #17.6
fxch st(3) #17.6
fmulp st(2), st #17.6
movsd xmm0, QWORD PTR [rsp] #17.6
fxch st(2) #17.6
fsubp st(1), st #17.6
fmulp st(1), st #17.6
fstp QWORD PTR [rsp] #17.6
movsd xmm14, QWORD PTR [rsp] #17.6
movsd QWORD PTR [rsp], xmm4 #17.6
fld QWORD PTR [rsp] #17.6
movsd QWORD PTR [rsp], xmm8 #17.6
fld st(0) #17.6
fmul st, st(1) #17.6
fld QWORD PTR [rsp] #17.6
fld st(0) #17.6
fmul st, st(1) #17.6
shl rsi, 4 #14.10
faddp st(2), st #17.6
fxch st(1) #17.6
fdivr st, st(3)

bernard gingold

unread,

Dec 14, 2018, 10:38:51 AM12/14/18

to

gfortran 5.5 output:

lea rdx, [rbx+8]
xor eax, eax
.L6:
movsd xmm1, QWORD PTR [rsp+1552+rax]
movsd xmm2, QWORD PTR [rcx+rax]
movapd xmm4, xmm1
movsd xmm6, QWORD PTR [rsp+528+rax]
movapd xmm5, xmm2
andpd xmm4, xmm0
andpd xmm5, xmm0
movsd xmm3, QWORD PTR [rsi+rax]
ucomisd xmm5, xmm4
jbe .L10
movapd xmm4, xmm1
divsd xmm4, xmm2
mulsd xmm1, xmm4
addsd xmm1, xmm2
movapd xmm2, xmm6
mulsd xmm2, xmm4
addsd xmm2, xmm3
mulsd xmm3, xmm4
subsd xmm3, xmm6
divsd xmm2, xmm1
divsd xmm3, xmm1
movapd xmm1, xmm3
.L5:
movsd QWORD PTR [rbx+rax], xmm2
movsd QWORD PTR [rdx+rax], xmm1
add rax, 16
cmp rax, 1024
jne .L6

Eugene Epshteyn

unread,

Dec 14, 2018, 12:25:54 PM12/14/18

to

Have you experimented with various -x options?

https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-x-qx

bernard gingold

unread,

Dec 14, 2018, 12:41:17 PM12/14/18

to

On Friday, December 14, 2018 at 6:25:54 PM UTC+1, Eugene Epshteyn wrote:
> Have you experimented with various -x options?
>
> https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-x-qx

I was using https://godbolt.org/ to run my experiments.
I will try to issue x,Qx options now.
Option: xAVX -> No Impact
Option: xCORE-AVX2 -> No Impact
Option: xHASWELL -> No Impact
Option: xSKYLAKE -> No Impact

Steve Lionel

unread,

Dec 14, 2018, 1:43:19 PM12/14/18

to

As I mentioned earlier, the compiler doesn't use AVX instructions if it
thinks that the overhead of starting AVX is worth it. This subroutine is
very short and it involves complex division.

The optimization report (-qopt-report=3) has some additional information:

Begin optimization report for: TEST

Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (TEST) [2] D:\Projects\t.f90(1,12)

Report from: Loop nest, Vector & Auto-parallelization optimizations
[loop, vec, par]

LOOP BEGIN at D:\Projects\t.f90(10,6)
remark #15300: LOOP WAS VECTORIZED
remark #15449: unmasked aligned unit stride stores: 2
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 12
remark #15477: vector cost: 3.000
remark #15478: estimated potential speedup: 4.000
remark #15488: --- end vector cost summary ---
remark #25015: Estimate of max trip count of loop=4
LOOP END

Non-optimizable loops:

LOOP BEGIN at D:\Projects\t.f90(9,6)
remark #15529: loop was not vectorized: volatile assignment was not
vectorized. Try using non-volatile assignment. [ D:\Projects\t.f90(9,6) ]
LOOP END

LOOP BEGIN at D:\Projects\t.f90(19,6)
remark #15529: loop was not vectorized: volatile assignment was not
vectorized. Try using non-volatile assignment. [ D:\Projects\t.f90(18,9) ]
LOOP END

Report from: Code generation optimizations [cg]

D:\Projects\t.f90(18,9):remark #34046: complex divide implemented using
x87 instructions to maintain precision.
D:\Projects\t.f90(18,9):remark #34048: consider using
complex-limited-range option to boost run time performance.
D:\Projects\t.f90(1,12):remark #34051: REGISTER ALLOCATION : [TEST]
D:\Projects\t.f90:1

bernard gingold

unread,

Dec 14, 2018, 2:24:54 PM12/14/18

to

Thanks for your response.
Volatile statement was later removed see post 4:36 PM
Maintaining precision at cost of 180 x87 and AVX instructions!! ... sic.
When setting precision of complex data type to single-precision x87 instructions were eliminated.

Steve you pasted this compiler output (see below)
I presume that the output below is related to simple data initialization loop
or to array syntax.

Thomas Koenig

unread,

Dec 15, 2018, 9:44:05 AM12/15/18

to

bernard gingold <ben...@gmail.com> schrieb:

FWITW, here is what a recent gfortran makes of this code,
with -Ofast -fverbose-asm -S -march=native on a Ryzen 7:

.L2:
# a.f90:11: a(i) = CMPLX(real(i,kind=8),real(i,kind=8))!<--- If
# this line of code
vxorps %xmm0, %xmm0, %xmm0 # tmp105
# a.f90:15: b(i) = CMPLX(3.14_8,3.14_8)
vmovsd %xmm2, 1632(%rsp,%rax) # tmp128, MEM[symbol: b, index: ivtmp.58_34, offset: 0B]
vmovsd %xmm1, 1640(%rsp,%rax) # tmp127, MEM[symbol: b, index: ivtmp.58_34, offset: 0B]
# a.f90:11: a(i) = CMPLX(real(i,kind=8),real(i,kind=8))!<--- If
# this line of code
vcvtsi2ss %edx, %xmm0, %xmm0 # i, tmp105, tmp105
vcvtss2sd %xmm0, %xmm0, %xmm0 # tmp105, _2
# a.f90:10: do i = 1_4, length
incl %edx # i
# a.f90:11: a(i) = CMPLX(real(i,kind=8),real(i,kind=8))!<--- If
# this line of code
vmovddup %xmm0, %xmm0 # _2, tmp107
vmovaps %xmm0, 608(%rsp,%rax) # tmp107, MEM[symbol: a, index: ivtmp.58_34, offset: 0B]
addq $16, %rax #, ivtmp.58
# a.f90:10: do i = 1_4, length
cmpl $65, %edx #, i
jne .L2 #,
# a.f90:15: b(i) = CMPLX(3.14_8,3.14_8)
xorl %eax, %eax # ivtmp.41
.p2align 4
.p2align 3
.L3:
vmovaps 608(%rsp,%rax), %xmm0 # MEM[symbol: a, index: ivtmp.41_33, offset: 0B], vect__46.12
vmovaps 1632(%rsp,%rax), %xmm5 # MEM[symbol: b, index: ivtmp.41_33, offset: 0B], vect__9.8
vpermilpd $1, %xmm0, %xmm4 #, vect__46.12, vect__46.17
vpermilpd $3, %xmm5, %xmm1 #,, vect__9.21
vpermilpd $0, %xmm5, %xmm2 #,, vect__9.9
vmovaps %xmm5, (%rsp) # vect__9.8, %sfp
# a.f90:18: c = a / b
vmulpd %xmm1, %xmm1, %xmm3 # vect__9.21, vect__9.21, vect_powmult_47.25
vmulpd %xmm4, %xmm1, %xmm1 # vect__46.17, vect__9.21, vect__45.22
vmovaps %xmm0, %xmm4 # vect__46.12, vect__44.23
vfmadd132pd %xmm2, %xmm1, %xmm4 # vect__9.9, vect__45.22, vect__44.23
vfmsub132pd %xmm2, %xmm1, %xmm0 # vect__9.9, vect__45.22, vect__44.24
vfmadd231pd %xmm2, %xmm2, %xmm3 # vect__9.9, vect__9.9, vect__12.27
vmovsd %xmm4, %xmm0, %xmm0 # vect__44.23, vect__44.24, tmp115
vdivpd %xmm3, %xmm0, %xmm0 # vect__12.27, tmp115, vect__54.28
vmovaps %xmm0, (%rbx,%rax) # vect__54.28, MEM[symbol: c, index: ivtmp.41_33, offset: 0B]
addq $16, %rax #, ivtmp.41
cmpq $1024, %rax #, ivtmp.41
jne .L3 #, .f90:22: print*, c

ga...@u.washington.edu

unread,

Dec 15, 2018, 7:20:41 PM12/15/18

to

On Thursday, December 13, 2018 at 1:04:58 PM UTC-8, Steve Lionel wrote:

(snip)

> I'm going to suggest that you are wasting your time with this. The
> compiler does a pretty good job of estimating performance gains, and can
> also keep track of a lot more if you're not using intrinsics. On the
> topic of AVX in particular, there is a penalty for starting to use AVX
> that can be heavy if you have sparse uses of AVX, as the processor will
> go "in and out" of AVX mode. The Intel compiler understands this and
> won't use AVX unless it sees a clear advantage.

I suspect you are right, and I also don't know the OP's reason
for doing it, but sometimes you do things just to see that they
can be done. That is, just for the fun of it.

Doing things for fun usually isn't a waste of time, because
the whole purpose is fun.

You might also say that doing crossword puzzles or soduko
is a waste of time, as nothing useful results.

> I am, in general, not in favor of "micro-optimization", especially if
> you haven't done the profiling and analysis to see if the code you're
> attacking has a meaningful impact. Just because AVX instructions exist
> that doesn't mean you will always benefit from their use.

Yes. Well, I haven't followed it quite close enough, but
it would be interesting to look at Tensorflow in regards to
AVX instructions. The idea, as well as I know it, is to
describe the array operation (at any rank) and let it optimize
the code for doing that operation.

In the OP's case, there will be a lot of subroutine call overhead,
and as noted load/store overhead. One might arrange routines
to do more combined operations (multiply add), or even
operations like dot product or matrix multiply, optimized
using specific instructions. (Or, if someone already did that,
look into improving what is already done.)

robin....@gmail.com

unread,

Dec 15, 2018, 7:47:16 PM12/15/18

to

On Saturday, December 15, 2018 at 2:31:05 AM UTC+11, bernard gingold wrote:
> On Friday, December 14, 2018 at 4:23:29 PM UTC+1, bernard gingold wrote:
> > On Friday, December 14, 2018 at 4:00:26 PM UTC+1, Eugene Epshteyn wrote:
> > > > complex(kind=8), dimension(64), volatile :: c
> > >
> > > Why is "c" declared as volatile?
> >
> > "c" was declared as volatile in order to force compiler to generate code of computational do-loop.
>
> Follow up message:
> Now array "c" is not declared as a volatile and ifort still generates x87 code.
>
> subroutine test()
> !DIR$ ATTRIBUTES ALIGN : 64 :: c,a,b
> complex(kind=8), dimension(64) :: c
> complex(kind=8), dimension(64) :: a
> complex(kind=8), dimension(64) :: b
> integer(kind=4) :: i
> integer(kind=4), parameter :: length = 64_4
> ! Exec code ...
> c = 0.0_8
> do i = 1_4, length
> a(i) = CMPLX(real(i,kind=8),real(i,kind=8))!<--- If this line of code
> ! is uncommented ifort as expected

If you are expecting the result to be of double precision
(based on your explicit conversion of the arguments to double),
you will be disappointed.
The result is ALWAYS of SINGLE precision, UNLESS the KIND of the result
is also specified.

> !a(i) = CMPLX(2.71_8,2.71_8)
> b(i) = CMPLX(3.14_8,3.14_8)

Ditto for this line.

bernard gingold

unread,

Dec 16, 2018, 3:38:42 PM12/16/18

to

My only question was -- Why Intel Compiler generated circa. 180 machine code instructions for complex division?
It is true , that I did not experiment with the compiler floating-point precision switches and what I was seeing could be
precise and slow implementation of complex division.
Now using an option: -O3, -fp-model fast=2 eliminated x87 code which was generated probably to enable
proper handling of potential overflow and/or catastrophic cancellation stemming from complex division formulae.
P.s.
DCMPLX was used.
This is the result:

mov eax, DWORD PTR [-112+rbp] #19.1
movsxd rax, eax #19.1
imul rax, rax, 16 #19.1
mov edx, offset flat: test_$A #19.1
add rdx, rax #19.1
add rdx, -16 #19.1
movupd xmm0, XMMWORD PTR [rdx] #19.1
mov eax, DWORD PTR [-112+rbp] #19.1
movsxd rax, eax #19.1
imul rax, rax, 16 #19.1
mov edx, offset flat: test_$B #19.1
add rdx, rax #19.1
add rdx, -16 #19.1
movupd xmm1, XMMWORD PTR [rdx] #19.1
movapd xmm2, xmm1 #19.1
unpcklpd xmm2, xmm1 #19.1
mulpd xmm2, xmm0 #19.1
shufpd xmm0, xmm0, 1 #19.1
movapd xmm3, xmm1 #19.1
unpckhpd xmm3, xmm1 #19.1
mulpd xmm0, xmm3 #19.1
xorpd xmm0, XMMWORD PTR .L_2il0floatpacket.1[rip] #19.1
addpd xmm2, xmm0 #19.1
mulpd xmm1, xmm1 #19.1
movapd xmm0, xmm1 #19.1
shufpd xmm0, xmm1, 1 #19.1
addpd xmm1, xmm0 #19.1
divpd xmm2, xmm1 #19.1
mov eax, DWORD PTR [-112+rbp] #19.13
movsxd rax, eax #19.13
imul rax, rax, 16 #19.13
mov edx, offset flat: test_$C #19.1
add rdx, rax #19.1
add rdx, -16 #19.13
movupd XMMWORD PTR [rdx], xmm2 #19.1
mov eax, 1 #20.1
add eax, DWORD PTR [-112+rbp] #20.1
mov DWORD PTR [-112+rbp], eax #20.1
mov eax, DWORD PTR [-112+rbp] #20.1
cmp eax, 64 #20.1
jle ..B1.7

Thomas Koenig

unread,

Dec 16, 2018, 4:38:48 PM12/16/18

to

bernard gingold <ben...@gmail.com> schrieb:

> My only question was -- Why Intel Compiler generated circa. 180
> machine code instructions for complex division?

I cannot speak for Intel, but complex division is complex (pun
intended).

Look at the pseudo-code generated with gfortran for a simple statement

c = a / b

where

<bb 2> :
_9 = REALPART_EXPR <*a_5(D)>;
_10 = IMAGPART_EXPR <*a_5(D)>;
_1 = COMPLEX_EXPR <_9, _10>;
_11 = REALPART_EXPR <*b_6(D)>;
_12 = IMAGPART_EXPR <*b_6(D)>;
_2 = COMPLEX_EXPR <_11, _12>;
_13 = ABS_EXPR <_11>;
_14 = ABS_EXPR <_12>;
_15 = _13 < _14;
if (_15 != 0)
goto <bb 4>; [50.00%]
else
goto <bb 5>; [50.00%]

<bb 4> :
_16 = _11 / _12;
_17 = _11 * _16;
_18 = _12 + _17;
_19 = _9 * _16;
_20 = _10 + _19;
_21 = _10 * _16;
_22 = _21 - _9;
_23 = _20 / _18;
_24 = _22 / _18;
_38 = _23;
_39 = _24;
goto <bb 3>; [100.00%]

<bb 5> :
_25 = _12 / _11;
_26 = _12 * _25;
_27 = _11 + _26;
_28 = _10 * _25;
_29 = _9 + _28;
_30 = _9 * _25;
_31 = _10 - _30;
_32 = _29 / _27;
_33 = _31 / _27;
_40 = _32;
_41 = _33;

<bb 3> :
# _36 = PHI <_38(4), _40(5)>
# _37 = PHI <_39(4), _41(5)>
_3 = COMPLEX_EXPR <_36, _37>;
_34 = _36;
_35 = _37;
REALPART_EXPR <*c_7(D)> = _34;
IMAGPART_EXPR <*c_7(D)> = _35;

(This is from the *cplxlower0 dump, which you will
get with -fdump-tree-cplxlower0 )

bernard gingold

unread,

Dec 17, 2018, 3:00:48 AM12/17/18

to

Yes fully agree with you.
It seems to me, that compilers (ifort) wil generate safe and slow implementation and unsafe and fast(er) implementation.
The difference lies in handling of possibly unsafe computations.
In regards to gfortran I can not see a two different implementations of complex division. The disassembly follows strictly a standard formula for complex division.

Without optimization (removing index calculations)
movapd xmm4, xmm0
divsd xmm4, xmm1
mulsd xmm0, xmm4
addsd xmm0, xmm1
movapd xmm1, xmm3

mulsd xmm1, xmm4
addsd xmm1, xmm2

mulsd xmm2, xmm4
subsd xmm2, xmm3
divsd xmm1, xmm0
divsd xmm2, xmm0
movapd xmm0, xmm2
With optimization: -O3

movapd xmm3, xmm1
divsd xmm3, xmm0
mulsd xmm1, xmm3
addsd xmm0, xmm1
movapd xmm1, xmm3
mulsd xmm1, xmm5
addsd xmm1, xmm2
mulsd xmm2, xmm3
divsd xmm1, xmm0

bernard gingold

unread,

Dec 17, 2018, 3:04:11 AM12/17/18

to

On Sunday, December 16, 2018 at 10:38:48 PM UTC+1, Thomas Koenig wrote:

Out of curiosity I checked implementation of std::complex on g++ and it seems that this safe builting function handles complex division even when -O3 is set

COMPILER_RT_ABI double _Complex
__divdc3(double __a, double __b, double __c, double __d)
{
int __ilogbw = 0;
double __logbw = crt_logb(crt_fmax(crt_fabs(__c), crt_fabs(__d)));
if (crt_isfinite(__logbw))
{
__ilogbw = (int)__logbw;
__c = crt_scalbn(__c, -__ilogbw);
__d = crt_scalbn(__d, -__ilogbw);
}
double __denom = __c * __c + __d * __d;
double _Complex z;
__real__ z = crt_scalbn((__a * __c + __b * __d) / __denom, -__ilogbw);
__imag__ z = crt_scalbn((__b * __c - __a * __d) / __denom, -__ilogbw);
if (crt_isnan(__real__ z) && crt_isnan(__imag__ z))
{
if ((__denom == 0.0) && (!crt_isnan(__a) || !crt_isnan(__b)))
{
__real__ z = crt_copysign(CRT_INFINITY, __c) * __a;
__imag__ z = crt_copysign(CRT_INFINITY, __c) * __b;
}
else if ((crt_isinf(__a) || crt_isinf(__b)) &&
crt_isfinite(__c) && crt_isfinite(__d))
{
__a = crt_copysign(crt_isinf(__a) ? 1.0 : 0.0, __a);
__b = crt_copysign(crt_isinf(__b) ? 1.0 : 0.0, __b);
__real__ z = CRT_INFINITY * (__a * __c + __b * __d);
__imag__ z = CRT_INFINITY * (__b * __c - __a * __d);
}
else if (crt_isinf(__logbw) && __logbw > 0.0 &&
crt_isfinite(__a) && crt_isfinite(__b))
{
__c = crt_copysign(crt_isinf(__c) ? 1.0 : 0.0, __c);
__d = crt_copysign(crt_isinf(__d) ? 1.0 : 0.0, __d);
__real__ z = 0.0 * (__a * __c + __b * __d);
__imag__ z = 0.0 * (__b * __c - __a * __d);
}
}
return z;
}

bernard gingold

unread,

Dec 17, 2018, 11:16:18 AM12/17/18

to

On Sunday, December 16, 2018 at 10:38:48 PM UTC+1, Thomas Koenig wrote:

Thomas do you plan to implement floating-point safe version of complex division?
As far as I was able to investigate 'gfortran' at least on Intel CPU generates always the same machine code.
By looking at your example I see, that only guard against the zero is implemented.
Thanks in advance.

Thomas Koenig

unread,

Dec 17, 2018, 1:25:52 PM12/17/18

to

bernard gingold <ben...@gmail.com> schrieb:

> On Sunday, December 16, 2018 at 10:38:48 PM UTC+1, Thomas Koenig wrote:

>> <bb 2> :
>> _9 = REALPART_EXPR <*a_5(D)>;
>> _10 = IMAGPART_EXPR <*a_5(D)>;
>> _1 = COMPLEX_EXPR <_9, _10>;
>> _11 = REALPART_EXPR <*b_6(D)>;
>> _12 = IMAGPART_EXPR <*b_6(D)>;
>> _2 = COMPLEX_EXPR <_11, _12>;
>> _13 = ABS_EXPR <_11>;
>> _14 = ABS_EXPR <_12>;
>> _15 = _13 < _14;
>> if (_15 != 0)
>> goto <bb 4>; [50.00%]
>> else
>> goto <bb 5>; [50.00%]

> Thomas do you plan to implement floating-point safe version of complex division?

It should be safe; if you find a test case where it isn't, please let us
know.

> As far as I was able to investigate 'gfortran' at least on Intel CPU generates always the same machine code.
> By looking at your example I see, that only guard against the zero is implemented.

The pseudocode is not easy to read. What the snippet above does is

if (abs(b%re) < abs(a%re)) then
goto 4
else
goto 5
end if

where labels 4 and 5 indicate branch taken for division in that case.

bernard gingold

unread,

Dec 17, 2018, 3:01:15 PM12/17/18

to

I was confused by (_15 != 0) statement and only later I understood, that there is a absolute value comparison between real parts.
After looking at gfortran implementation I realized , that you guys implemented Smith's complex division algorithm: https://dl.acm.org/citation.cfm?id=368661

I was thinking about writing some tests between naive complex division and Smith's implementation.It should be really interested to look for the breaking point and to test for loss of precision.

bernard gingold

unread,

Dec 20, 2018, 3:56:52 PM12/20/18

to

On Thursday, December 13, 2018 at 10:11:45 AM UTC+1, bernard gingold wrote:
> I decided to write a set of Fortran interfaces to specific part of Intel AVX-AVX3 intrinsics.
> Vector intrinsics are wrapped in thin C layer which calls appropriate simd intrinsic:
> For your convenience I post a simple code snippet.
> BEGIN example:
>
> Compiler: icc 15
>
> C-side
>
> _m256d v256_add_pd(__m256d v1, __m256d v2) {
> return(_mm256_add_pd(v1,v2));
> }
> --------------------------------------------------------------
> Compiler: ifort 15
>
> Fortran-side
>
> type, bind(c) :: m256d
> real(c_double), dimension(0:3) :: v
> end type m256d
>
> interface
> type(m256d) function v256_add_pd(v1,v2) bind(c,name='v256_add_pd')
> use , intrinsic :: iso_c_binding
> import :: m256d
> type(m256d), intent(in) :: v1
> type(m256d), intent(in) :: v2
> end function v256_add_pd
> end interface
>
> END example
>
> The code compiles and links fine, but during the runtime interoperability between Fortran code and C code fails providing wrong result.
> During the debugging C-side assembly code (debug build) calls vaddpd machine code instruction which returns it's result in YMM0 register.
> Fortran-side compiler generated assembly code did not issue a load instruction
> to copy a content of YMM0 register to automatic variable of type m256d which should be compatible with C declaration of __m256d struct.
> C struct '__m256d' contains interoperable data member: a static array: double v[4] and struct is aligned on 32-byte boundary (defined in immintrin.h provided by Intel compiler).
> I suppose that ifort compiler failed to recognize return type of my C wrapper to Intel _mm256_add_pd intrinsic being fully compatible with derived type 'm256d'.
> Problem could stem probably from disaggregation of __m256d struct by placing its content inside YMM register.

Finally I have a found a solution to my problem.
By my educated guess AVX data type: __m256d declaration was not 'seen' by the compiler and hence incorrect result was computed.
The fixed implementation:

BEGIN example:

Compiler: icc 15

C-side

typedef struct __declspec(align(32)) v256d{
double v[4];
}v256d;

v256d v256_add_pd(v256d v1, v256d v2) {
return( *(v256d*)&_mm256_add_pd(*(__m256d*)&v1,*(__m256d*)&v2)));
}
--------------------------------------------------------------
Compiler: ifort 15

Fortran-side

type, bind(c) :: v256d
real(c_double), dimension(0:3) :: v
end type v256d

interface
function v256_add_pd(v1,v2) bind(c,name='v256_add_pd')
use , intrinsic :: iso_c_binding
import :: m256d
type(v256d), intent(in) :: v1
type(v256d), intent(in) :: v2
type(v256d) :: v256_add_pd
end function v256_add_pd
end interface

END example

FortranFan

unread,

Dec 20, 2018, 7:13:13 PM12/20/18

to

On Thursday, December 20, 2018 at 3:56:52 PM UTC-5, bernard gingold wrote:

> ..

> Finally I have a found a solution to my problem.

> ..

> v256d v256_add_pd(v256d v1, v256d v2) {

..

> interface
> function v256_add_pd(v1,v2) bind(c,name='v256_add_pd')
> use , intrinsic :: iso_c_binding
> import :: m256d
> type(v256d), intent(in) :: v1
> type(v256d), intent(in) :: v2

> ..

@bernard gingold,

You ignored the question in the first response to your post on this thread.

Strictly speaking, given the C function prototype, the interface in Fortran must have the VALUE attribute applied to the 2 dummy arguments of v1 and v2 (or if the Fortran interface is to be retained as you show, the C function prototype must have the 2 parameters as pass-by-reference).

It's fortuitous given the data structure of 'v256d' struct and how some of the widely used compilers such as Intel and GCC lay out the data that you find your code works either way. Try checking out the interoperability of C and Fortran with a different struct and see what happens if the function parameters do not conform, say with

typedef struct tag_foo {
char c[4];
} foo;

Message has been deleted

bernard gingold

unread,

Dec 21, 2018, 3:19:37 AM12/21/18

to

Thank you for your answer.
There is a typo in code snippet of fixed implementation -- it should have been written: import :: v256d

Now in regards to adding VALUE attribute I did reply to @Louis Krupp and here is my response:

>>>> Hi Louis,
>>> Thanks for the reply.
>>> I added a keyword 'value' unfortunately it did not help.

Adding attribute VALUE did not help.
For sake of clarity C function was declared or rather defined like this:

_m256d v256_add_pd(__m256d v1, __m256d v2)

Upon the end of execution of C function the result of simple vector addition was placed in YMM0 register as expected and
automatic derived type variable was initialized by alternating mix of 0.0E+00 values and stack uninitialized memory patterns.
Upon the prologue of C function the correct vector load of v1 and v2 arguments was executed and C function operated on proper values.

For the sake of clarity struct __m256d is declared in header "immintrin.h"
like this:
typedef struct __declspec(align(32)) __m256d{
double v[4];
}__m256d;

As you may see C-side struct __m256d is fully compatible with Fortran-side derived type v256d.
At assmebly level I expected to see approximately 2 instructions
load of address of local variable v256d disaggregeated to storage block of 32-bytes followed by vector load or copy of YMM0 content to that storage area.

Eugene Epshteyn

unread,

Dec 22, 2018, 11:12:38 PM12/22/18

to

The "value" attribute may not have helped here, because "by value" arguments in C could in fact be passed by reference behind the scenes. (For example, if they don't fit into registers.)

You can see this by getting LLVM IR dump from clang compiler:

Original C code:

typedef struct __declspec(align(32)) {
double v[4];
}__m256d;

__m256d f(__m256d a, __m256d b)
{
return a ;
}

Resulting LLVM IR with optimizations turned off:

%struct.__m256d = type { [4 x double] }

; Function Attrs: noinline nounwind optnone uwtable
define dso_local void @f(%struct.__m256d* noalias sret %agg.result, %struct.__m256d* %a, %struct.__m256d* %b) #0 {
entry:
%0 = bitcast %struct.__m256d* %agg.result to i8*
%1 = bitcast %struct.__m256d* %a to i8*
call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 32 %0, i8* align 32 %1, i64 32, i1 false)
ret void
}

As you can see, declaration of f() was changed to accept the input parameters by reference, as well as return the result by reference as well.

ga...@u.washington.edu

unread,

Dec 23, 2018, 12:53:00 AM12/23/18

to

On Saturday, December 22, 2018 at 8:12:38 PM UTC-8, Eugene Epshteyn wrote:
> The "value" attribute may not have helped here, because
> "by value" arguments in C could in fact be passed by reference
> behind the scenes. (For example, if they don't fit into registers.)

C uses pass by value. How that is implemented is up to each
compiler designer. One possible implementation is to pass
an address and have the called routine make a local copy of
the value.

Traditionally, C pushed all values on the stack.

A more recent popular calling convention puts the first few
arguments in registers, still leaving space on the stack.

A companion C compiler for a Fortran compiler must use the
same calling convention, and so it is normally appropriate
for arguments to have VALUE, independent of the underlying
mechanism. Note that C can pass pointers by value, which
looks similar to call by reference, if the pointer in
the called routine isn't changed.

If the prototype doesn't have VALUE, then Fortran will usually
pass the address of the argument. If the convention, as noted
above, passes the address, then without VALUE, it will instead
pass the address of the address.

bernard gingold

unread,

Dec 23, 2018, 3:35:51 AM12/23/18

to

@Eugene Epshteyn

Yes exactly.
In my case adding value attribute caused random crashes.
By investigating an assembly representation of machine code it became clear to me that behind the curtain compiler disaggregated struct to its member(array) and array type decayed to pointer to it's first element.

By looking at LLVM IR code I see, that call to memcpy is performed on 32-bytes storage block. I suppose that this maybe an unoptimized stage of code transformation, that later 'vmovupd' machine code instruction will be emitted.

bernard gingold

unread,

Dec 23, 2018, 3:41:43 AM12/23/18

to

What I see at the prologue of C-side function or rather its machine code implementation is passing an address by RIP-relative offset to disaggregated derived type (address of first double precision element). I presume, that 32-byte storage blocks are located in .text section because of initialization which is compile time constants.

ga...@u.washington.edu

unread,

Dec 23, 2018, 6:19:05 AM12/23/18

to

On Saturday, December 22, 2018 at 8:12:38 PM UTC-8, Eugene Epshteyn wrote:

I am not so sure I understand it, but as far as I know the
arguemnts are still passed by value.

For functions returning a struct, though, it is usual for one
bigger than a register that the caller passes the address of a
buffer for the callee to store the result.

If the return value fits in a register, then it is returned
in an appropriate register.

For 32 bit CDECL, a 64 bit result is returned in EDX:EAX.

I believe for a small struct, it is still in a register, but for
larger ones an address is passed.

Message has been deleted

bernard gingold

unread,

Dec 23, 2018, 11:25:38 AM12/23/18

to

This how does it look in assembly.

void v256_add_pd(double * __restrict a, double * __restrict b, double * __restrict c) {

_mm256_store_pd(&c[0], _mm256_add_pd(*(__m256d*)&a[0], *(__m256d*)&b[0]));
}
// Fortran Interface
interface
subroutine v256_add_pd(a,b,c) bind(c,name='v256_add_pd')
use, intrinsic :: ISO_C_BINDING
real(c_double), dimension(4), intent(in) :: a
real(c_double), dimension(4), intent(in) :: b
real(c_double), dimension(4), intent(out) :: c
end subroutine
end interface

000007F7B7A611A0 C5 FD 10 01 vmovupd ymm0,ymmword ptr [rcx]
000007F7B7A611A4 C5 FD 58 0A vaddpd ymm1,ymm0,ymmword ptr [rdx]
000007F7B7A611A8 C4 C1 7D 11 08 vmovupd ymmword ptr [r8],ymm1

000007F7B7A611AD C5 F8 77 vzeroupper
000007F7B7A611B0 C3 ret

==============================================================================================================================

v256d v256_add_pd2(v256d a, v256d b) {
return (*(v256d*)&(_mm256_add_pd(*(__m256d*)&a,*(__m256d*)&b)));
}
// Fortran Interface
interface
function v256_add_pd2(a,b) bind(c,name='v256_add_pd2')
use, intrinsic :: ISO_C_BINDING
import :: v256d

type(v256d), intent(in) :: a
type(v256d), intent(in) :: b
type(v256d) :: v256_add_pd2
end function
end interface

000007F6B6261105 48 8D 15 B4 40 00 00 lea rdx,[__security_cookie_complement+1A8h (07F6B62651C0h)]
000007F6B626110C 49 8D 4D 00 lea rcx,[r13]
000007F6B6261110 4C 8D 05 E9 40 00 00 lea r8,[__security_cookie_complement+1E8h (07F6B6265200h)]
000007F6B6261117 E8 D4 00 00 00 call v256_add_pd2 (07F6B62611F0h)
000007F6B62611F0 48 83 EC 78 sub rsp,78h

000007F6B62611F4 48 89 C8 mov rax,rcx

000007F6B62611F7 4C 89 6C 24 60 mov qword ptr [rsp+60h],r13
000007F6B62611FC 4C 8D 6C 24 3F lea r13,[rsp+3Fh]
21:
000007F6B6261201 C5 FD 10 02 vmovupd ymm0,ymmword ptr [rdx]
1:
2:

20:
000007F6B6261205 49 83 E5 E0 and r13,0FFFFFFFFFFFFFFE0h
21:
000007F6B6261209 C4 C1 7D 58 08 vaddpd ymm1,ymm0,ymmword ptr [r8]
000007F6B626120E C4 C1 7D 11 4D 00 vmovupd ymmword ptr [r13],ymm1
000007F6B6261214 C4 C1 78 10 55 10 vmovups xmm2,xmmword ptr [r13+10h]
000007F6B626121A C4 C1 78 10 5D 00 vmovups xmm3,xmmword ptr [r13]
000007F6B6261220 C5 F8 11 51 10 vmovups xmmword ptr [rcx+10h],xmm2
000007F6B6261225 C5 F8 11 19 vmovups xmmword ptr [rcx],xmm3
000007F6B6261229 4C 8B 6C 24 60 mov r13,qword ptr [rsp+60h]
000007F6B626122E C5 F8 77 vzeroupper
000007F6B6261231 48 83 C4 78 add rsp,78h
000007F6B6261235 C3 ret

FortranFan

unread,

Dec 23, 2018, 12:06:09 PM12/23/18

to

On Sunday, December 23, 2018 at 3:35:51 AM UTC-5, bernard gingold wrote:

> On Sunday, December 23, 2018 at 5:12:38 AM UTC+1, Eugene Epshteyn wrote:
> > The "value" attribute may not have helped here, because "by value" arguments in C could in fact be passed by reference behind the scenes. (For example, if they don't fit into registers.)
> >

> ..
> @Eugene Epshteyn
>
> Yes exactly.
> ..

@bernard gingold,

It was just a suggestion to you to make the interface in Fortran consistent with the prototype in C which should work alright in the *general case*.

Say you have C code as follows:

#define N 4

typedef struct _foo_t {
int x;
char c[N];
} foo_t;

foo_t add_foo(foo_t f1, foo_t f2) {
foo_t foo;
for (int i=0; i<N; i++) foo.c[i]=f1.c[i];
foo.x = f1.x + f2.x;
return foo;
}

Now try your approach to C interop per your original post or the one with your correction both of which are *without* the VALUE attribute in the Fortran interface (in this case for 'add_foo' function) and see what happens.

bernard gingold

unread,

Dec 23, 2018, 1:27:33 PM12/23/18

to

@FortanFan

This is how does it look a call to C-side function. You may see that Fortran interface is declared without 'VALUE' attribute, yet compiler disaggregated (that is the proper name) derived type v256d to its array member and base address(array-to-pointer decay) of array member was passed to v256_add_pd2
function.

000007F6B6261105 48 8D 15 B4 40 00 00 lea rdx,[__security_cookie_complement+1A8h (07F6B62651C0h)] <-- Base address of array (array-to-pointer decay).
000007F6B626110C 49 8D 4D 00 lea rcx,[r13] <-- Base address of array (array-to-pointer decay) where a return value is placed.
000007F6B6261110 4C 8D 05 E9 40 00 00 lea r8,[__security_cookie_complement+1E8h (07F6B6265200h)] <-- Base address of array (array-to-pointer decay)

bernard gingold

unread,

Dec 23, 2018, 1:38:30 PM12/23/18

to

@FortranFan
This is a follow up message.
I came to conclusion, that the better approach which resulted in more compact machine code is to declare this:
Please see below:

void v256_add_pd(double * __restrict a, double * __restrict b, double * __restrict c) {

_mm256_store_pd(&c[0], _mm256_add_pd(*(__m256d*)&a[0], *(__m256d*)&b[0]));
}
// Fortran Interface
interface
subroutine v256_add_pd(a,b,c) bind(c,name='v256_add_pd')
use, intrinsic :: ISO_C_BINDING
real(c_double), dimension(4), intent(in) :: a
real(c_double), dimension(4), intent(in) :: b
real(c_double), dimension(4), intent(out) :: c
end subroutine
end interface

And this is a corresponding machine code implementation which much shorter than
function returning struct approach.

ga...@u.washington.edu

unread,

Dec 24, 2018, 9:07:57 AM12/24/18

to

On Sunday, December 23, 2018 at 10:27:33 AM UTC-8, bernard gingold wrote:

(snip)

> This is how does it look a call to C-side function. You may
> see that Fortran interface is declared without 'VALUE' attribute,
> yet compiler disaggregated (that is the proper name) derived type
> v256d to its array member and base address(array-to-pointer decay)
> of array member was passed to v256_add_pd2 function.

Reminds me, in K&R C (before the ANSI standard) only pointers to
struct could be used as function arguments, not a struct itself.
(That is, only a reference, not a value.)

In many cases, I still believe that is the best way.

The OP will probably have to try different ways to see which one
works best, but passing the pointer might mean not storing the
value on the stack first.

My usual way to do these is to compile a simple C program, look
at the generated assembly code, and then write assembly code with
the same register interface.

bernard gingold

unread,

Dec 26, 2018, 5:35:47 AM12/26/18

to

I made a decision to use extensively void function/subroutine way.
The rationale standing behind this decision is -- only 3 machine code instructions (not counting, vzeroupper and ret) compared to ~15 machine ccode instructions in case of struct returning function.

In case of struct returning function compiler creates unnecessary and probably temporary copy of the result generating unneeded memory transfer of halved vector register. From my point of view this is unacceptable.

brook...@gmail.com

unread,

Jan 14, 2019, 12:17:46 PM1/14/19

to

CONTACT US FOR Pain Pills,Nembutal,Seconal ,Oxycotin, Dilaudid,Valium,Hydrocodone & Ritalin Online.
call or text +1(405)500-0724
website : https://medsonlinestore.yolasite.com/PRODUCTS.php
email me at cliffbu...@gmail.com
Hello, we are suppliers of assorted pain killers and anxietay pain relief meds, and other research chemicals. Discount are also applicable for bulk buyers.The shipping is meticulously planned; packaging is done with professionalis.
We have the following meds below available in stock now for auction;
Pain/ Anxiety Pills
Seconal
Nembutal (Powder,Pills and Liquid form)
Oxycotin / Oxycodone 10,20 a,40 and 80 mg
Actavis Promethazine Codeine Purple Cough Syrup (16oz and 320z)
Hydrocodone 10500, 10325 ,7.5750 mg
Valium 10,15 and 20 mg
Xanax 1 and 2 mg
Dilaudid 2,4 and 8 mg
Ritalin 5,10, 20 mg
Percocet 7.5mg,5mg and 10mg
Opana 20mg and 40mg
Lorcet - (Hydrocodone Bitartrate/Acetaminophen) 10 mg/650 mg
Midazolam 3mg
Motrin 400mg and 600mg
Norco - ( Hydrocodone Bitartrate/Acetaminophen ) 5 mg/325 mg
Soma 350mg
Tramadol (Ultram) 50mg
Valium 2mg,5mg and 10mg
Valium Roche Brand 10mg
Voltaren 50mg and 100mg
Adderall,Anaprox,Ansaid,Acephen
Bupren , ex,Butrans
Percocet,Phrenilin,Percodan
Soma, , Subutex
Cataflam,Celebrex
Flexeril, Fentora ..,
Demerol,Daypro,Dilaudid
Endocet
Lorcet, L, ortab
Ibudone
Methadone,Morphine
Naprosyn , ,Norco
Oxycontin, Opana
Ritalin, Roxicodone®