Is FMA/Muladd Working Here?

307 views
Skip to first unread message

Chris Rackauckas

unread,
Sep 21, 2016, 1:56:45 AM9/21/16
to julia-users
Hi,
  First of all, does LLVM essentially fma or muladd expressions like `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one explicitly use `muladd` and `fma` on these types of instructions (is there a macro for making this easier)?

  Secondly, I am wondering if my setup is no applying these operations correctly. Here's my test code:

f(x) = 2.0x + 3.0
g(x) = muladd(x,2.0, 3.0)
h(x) = fma(x,2.0, 3.0)

@code_llvm f(4.0)
@code_llvm g(4.0)
@code_llvm h(4.0)

@code_native f(4.0)
@code_native g(4.0)
@code_native h(4.0)

Computer 1

Julia Version 0.5.0-rc4+0
Commit 9c76c3e* (2016-09-09 01:43 UTC)
Platform Info:
  System: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblasp.so.0
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)

(the COPR nightly on CentOS7) with 

[crackauc@crackauc2 ~]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
Stepping:              1
CPU MHz:               1200.000
BogoMIPS:              6392.58
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15



I get the output

define double @julia_f_72025(double) #0 {
top:
  %1 = fmul double %0, 2.000000e+00
  %2 = fadd double %1, 3.000000e+00
  ret double %2
}

define double @julia_g_72027(double) #0 {
top:
  %1 = call double @llvm.fmuladd.f64(double %0, double 2.000000e+00, double 3.000000e+00)
  ret double %1
}

define double @julia_h_72029(double) #0 {
top:
  %1 = call double @llvm.fma.f64(double %0, double 2.000000e+00, double 3.000000e+00)
  ret double %1
}
.text
Filename: fmatest.jl
pushq %rbp
movq %rsp, %rbp
Source line: 1
addsd %xmm0, %xmm0
movabsq $139916162906520, %rax  # imm = 0x7F40C5303998
addsd (%rax), %xmm0
popq %rbp
retq
nopl (%rax,%rax)
.text
Filename: fmatest.jl
pushq %rbp
movq %rsp, %rbp
Source line: 2
addsd %xmm0, %xmm0
movabsq $139916162906648, %rax  # imm = 0x7F40C5303A18
addsd (%rax), %xmm0
popq %rbp
retq
nopl (%rax,%rax)
.text
Filename: fmatest.jl
pushq %rbp
movq %rsp, %rbp
movabsq $139916162906776, %rax  # imm = 0x7F40C5303A98
Source line: 3
movsd (%rax), %xmm1           # xmm1 = mem[0],zero
movabsq $139916162906784, %rax  # imm = 0x7F40C5303AA0
movsd (%rax), %xmm2           # xmm2 = mem[0],zero
movabsq $139925776008800, %rax  # imm = 0x7F43022C8660
popq %rbp
jmpq *%rax
nopl (%rax)

It looks like explicit muladd or not ends up at the same native code, but is that native code actually doing an fma? The fma native is different, but from a discussion on the Gitter it seems that might be a software FMA? This computer is setup with the BIOS setting as LAPACK optimized or something like that, so is that messing with something?

Computer 2

Julia Version 0.6.0-dev.557
Commit c7a4897 (2016-09-08 17:50 UTC)
Platform Info:
  System: NT (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)


on a 4770k i7, Windows 10, I get the output

; Function Attrs: uwtable
define double @julia_f_66153(double) #0 {
top:
  %1 = fmul double %0, 2.000000e+00
  %2 = fadd double %1, 3.000000e+00
  ret double %2
}

; Function Attrs: uwtable
define double @julia_g_66157(double) #0 {
top:
  %1 = call double @llvm.fmuladd.f64(double %0, double 2.000000e+00, double 3.000000e+00)
  ret double %1
}

; Function Attrs: uwtable
define double @julia_h_66158(double) #0 {
top:
  %1 = call double @llvm.fma.f64(double %0, double 2.000000e+00, double 3.000000e+00)
  ret double %1
}
.text
Filename: console
pushq %rbp
movq %rsp, %rbp
Source line: 1
addsd %xmm0, %xmm0
movabsq $534749456, %rax        # imm = 0x1FDFA110
addsd (%rax), %xmm0
popq %rbp
retq
nopl (%rax,%rax)
.text
Filename: console
pushq %rbp
movq %rsp, %rbp
Source line: 2
addsd %xmm0, %xmm0
movabsq $534749584, %rax        # imm = 0x1FDFA190
addsd (%rax), %xmm0
popq %rbp
retq
nopl (%rax,%rax)
.text
Filename: console
pushq %rbp
movq %rsp, %rbp
movabsq $534749712, %rax        # imm = 0x1FDFA210
Source line: 3
movsd dcabs164_(%rax), %xmm1  # xmm1 = mem[0],zero
movabsq $534749720, %rax        # imm = 0x1FDFA218
movsd (%rax), %xmm2           # xmm2 = mem[0],zero
movabsq $fma, %rax
popq %rbp
jmpq *%rax
nop

This seems to be similar to the first result.

Erik Schnetter

unread,
Sep 21, 2016, 9:22:06 AM9/21/16
to julia...@googlegroups.com
On Wed, Sep 21, 2016 at 1:56 AM, Chris Rackauckas <rack...@gmail.com> wrote:
Hi,
  First of all, does LLVM essentially fma or muladd expressions like `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one explicitly use `muladd` and `fma` on these types of instructions (is there a macro for making this easier)?

Yes, LLVM will use fma machine instructions -- but only if they lead to the same round-off error as using separate multiply and add instructions. If you do not care about the details of conforming to the IEEE standard, then you can use the `@fastmath` macro that enables several optimizations, including this one. This is described in the manual <http://docs.julialang.org/en/release-0.5/manual/performance-tips/#performance-annotations>.


  Secondly, I am wondering if my setup is no applying these operations correctly. Here's my test code:

f(x) = 2.0x + 3.0
g(x) = muladd(x,2.0, 3.0)
h(x) = fma(x,2.0, 3.0)

@code_llvm f(4.0)
@code_llvm g(4.0)
@code_llvm h(4.0)

@code_native f(4.0)
@code_native g(4.0)
@code_native h(4.0)

Computer 1

Julia Version 0.5.0-rc4+0
Commit 9c76c3e* (2016-09-09 01:43 UTC)
Platform Info:
  System: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblasp.so.0
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)

This looks good, the "broadwell" architecture that LLVM uses should imply the respective optimizations. Try with `@fastmath`.

-erik

Páll Haraldsson

unread,
Sep 21, 2016, 9:44:48 AM9/21/16
to julia-users

On Wednesday, September 21, 2016 at 5:56:45 AM UTC, Chris Rackauckas wrote:
Julia Version 0.5.0-rc4+0
 
I'm not saying it matters here, but is this version know to be identical to the released 0.5? Unless you know, in general, bugs should be reported on latest version.

--
Palli.

Simon Byrne

unread,
Sep 21, 2016, 12:00:02 PM9/21/16
to julia-users
On Wednesday, 21 September 2016 06:56:45 UTC+1, Chris Rackauckas wrote:
Hi,
  First of all, does LLVM essentially fma or muladd expressions like `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one explicitly use `muladd` and `fma` on these types of instructions (is there a macro for making this easier)?

You will generally need to use muladd, unless you use @fastmath.

 
  Secondly, I am wondering if my setup is no applying these operations correctly. Here's my test code:

If you're using the prebuilt downloads (as opposed to building from source), you will need to rebuild the sysimg (look in contrib/build_sysimg.jl) as we build for the lowest-common architecture.

-Simon

Chris Rackauckas

unread,
Sep 21, 2016, 2:36:04 PM9/21/16
to julia-users
The Windows one is using the pre-built binary. The Linux one uses the COPR nightly (I assume that should build with all the goodies?)

Milan Bouchet-Valat

unread,
Sep 21, 2016, 3:11:34 PM9/21/16
to julia...@googlegroups.com
Le mercredi 21 septembre 2016 à 11:36 -0700, Chris Rackauckas a écrit :
> The Windows one is using the pre-built binary. The Linux one uses the
> COPR nightly (I assume that should build with all the goodies?)
The Copr RPMs are subject to the same constraint as official binaries:
we need them to work on most machines. So they don't enable FMA (nor
e.g. AVX) either.

It would be nice to find a way to ship with several pre-built sysimg
files and using the highest instruction set supported by your CPU.


Regards

Chris Rackauckas

unread,
Sep 21, 2016, 3:15:41 PM9/21/16
to julia-users
I see. So what I am getting is that, in my codes, 

1. I will need to add @fastmath anywhere I want these optimizations to show up. That should be easy since I can just add it at the beginnings of loops where I have @inbounds which already covers every major inner loop I have. Easy find/replace fix. 

2. For my own setup, I am going to need to build from source to get all the optimizations? I would've though the point of using the Linux repositories instead of the generic binaries is that they would be setup to build for your system. That's just a non-expert's misconception I guess? I think that should be highlighted somewhere.

Milan Bouchet-Valat

unread,
Sep 21, 2016, 4:15:57 PM9/21/16
to julia...@googlegroups.com
Le mercredi 21 septembre 2016 à 12:15 -0700, Chris Rackauckas a écrit :
> I see. So what I am getting is that, in my codes, 
>
> 1. I will need to add @fastmath anywhere I want these optimizations
> to show up. That should be easy since I can just add it at the
> beginnings of loops where I have @inbounds which already covers every
> major inner loop I have. Easy find/replace fix. 
>
> 2. For my own setup, I am going to need to build from source to get
> all the optimizations? I would've though the point of using the Linux
> repositories instead of the generic binaries is that they would be
> setup to build for your system. That's just a non-expert's
> misconception I guess? I think that should be highlighted somewhere.
No, the point of using Linux packages is to integrate easily with the
rest of the system (e.g. automated updates, installation in path
without manual tweaking), and to use your distribution's libraries to
avoid duplication.

That's just how it works for any software in a distribution. You need
to use Gentoo if you want software to be tuned at build time to your
particular system.


Regards

Chris Rackauckas

unread,
Sep 21, 2016, 9:22:45 PM9/21/16
to julia-users
I'm not seeing `@fastmath` apply fma/muladd. I rebuilt the sysimg and now I get results where g and h apply muladd/fma in the native code, but a new function k which is `@fastmath` inside of f does not apply muladd/fma.

https://gist.github.com/ChrisRackauckas/b239e33b4b52bcc28f3922c673a25910

Should I open an issue?

Note that this is on v0.6 Windows. On Linux the sysimg isn't rebuilding for some reason, so I may need to just build from source.

Erik Schnetter

unread,
Sep 21, 2016, 9:29:44 PM9/21/16
to julia...@googlegroups.com
On Wed, Sep 21, 2016 at 9:22 PM, Chris Rackauckas <rack...@gmail.com> wrote:
I'm not seeing `@fastmath` apply fma/muladd. I rebuilt the sysimg and now I get results where g and h apply muladd/fma in the native code, but a new function k which is `@fastmath` inside of f does not apply muladd/fma.


In your case, LLVM apparently thinks that `x + x + 3` is faster to calculate than `2x+3`. If you use a less round number than `2` multiplying `x`, you might see a different behaviour.

-erik

Chris Rackauckas

unread,
Sep 21, 2016, 9:32:26 PM9/21/16
to julia-users
Still no FMA?

julia> k(x) = @fastmath 2.4x + 3.0
WARNING: Method definition k(Any) in module Main at REPL[14]:1 overwritten at REPL[23]:1.
k (generic function with 1 method)

julia> @code_llvm k(4.0)

; Function Attrs: uwtable
define double @julia_k_66737(double) #0 {
top:
  %1 = fmul fast double %0, 2.400000e+00
  %2 = fadd fast double %1, 3.000000e+00
  ret double %2
}

julia> @code_native k(4.0)
        .text
Filename: REPL[23]
        pushq   %rbp
        movq    %rsp, %rbp
        movabsq $568231032, %rax        # imm = 0x21DE8478
Source line: 1
        vmulsd  (%rax), %xmm0, %xmm0
        movabsq $568231040, %rax        # imm = 0x21DE8480
        vaddsd  (%rax), %xmm0, %xmm0
        popq    %rbp
        retq
        nopw    %cs:(%rax,%rax)

Yichao Yu

unread,
Sep 21, 2016, 9:33:25 PM9/21/16
to Julia Users
On Wed, Sep 21, 2016 at 9:29 PM, Erik Schnetter <schn...@gmail.com> wrote:
> On Wed, Sep 21, 2016 at 9:22 PM, Chris Rackauckas <rack...@gmail.com>
> wrote:
>>
>> I'm not seeing `@fastmath` apply fma/muladd. I rebuilt the sysimg and now
>> I get results where g and h apply muladd/fma in the native code, but a new
>> function k which is `@fastmath` inside of f does not apply muladd/fma.
>>
>> https://gist.github.com/ChrisRackauckas/b239e33b4b52bcc28f3922c673a25910
>>
>> Should I open an issue?
>
>
> In your case, LLVM apparently thinks that `x + x + 3` is faster to calculate
> than `2x+3`. If you use a less round number than `2` multiplying `x`, you
> might see a different behaviour.

I've personally never seen llvm create fma from mul and add. We might
not have the llvm passes enabled if LLVM is capable of doing this at
all.

Yichao Yu

unread,
Sep 21, 2016, 9:46:07 PM9/21/16
to Julia Users
On Wed, Sep 21, 2016 at 9:33 PM, Yichao Yu <yyc...@gmail.com> wrote:
> On Wed, Sep 21, 2016 at 9:29 PM, Erik Schnetter <schn...@gmail.com> wrote:
>> On Wed, Sep 21, 2016 at 9:22 PM, Chris Rackauckas <rack...@gmail.com>
>> wrote:
>>>
>>> I'm not seeing `@fastmath` apply fma/muladd. I rebuilt the sysimg and now
>>> I get results where g and h apply muladd/fma in the native code, but a new
>>> function k which is `@fastmath` inside of f does not apply muladd/fma.
>>>
>>> https://gist.github.com/ChrisRackauckas/b239e33b4b52bcc28f3922c673a25910
>>>
>>> Should I open an issue?
>>
>>
>> In your case, LLVM apparently thinks that `x + x + 3` is faster to calculate
>> than `2x+3`. If you use a less round number than `2` multiplying `x`, you
>> might see a different behaviour.
>
> I've personally never seen llvm create fma from mul and add. We might
> not have the llvm passes enabled if LLVM is capable of doing this at
> all.

Interestingly both clang and gcc keeps the mul and add with `-Ofast
-ffast-math -mavx2` and makes it a fma with `-mavx512f`. This is true
even when the call is in a loop (since switching between sse and avx
is costly) so I'd say either the compiler is right that the fma
instruction gives no speed advantage in this case or it's a llvm/gcc
missing optimization instead of a julia one.

Erik Schnetter

unread,
Sep 21, 2016, 9:50:14 PM9/21/16
to julia...@googlegroups.com
I confirm that I can't get Julia to synthesize a `vfmadd` instruction either... Sorry for sending you on a wild goose chase.

-erik

Yichao Yu

unread,
Sep 21, 2016, 10:11:14 PM9/21/16
to Julia Users
On Wed, Sep 21, 2016 at 9:49 PM, Erik Schnetter <schn...@gmail.com> wrote:
> I confirm that I can't get Julia to synthesize a `vfmadd` instruction
> either... Sorry for sending you on a wild goose chase.

-march=haswell does the trick for C (both clang and gcc)
the necessary bit for the machine ir optimization (this is not a llvm
ir optimization pass) to do this is llc options -mcpu=haswell and
function attribute unsafe-fp-math=true.

Chris Rackauckas

unread,
Sep 22, 2016, 7:46:13 PM9/22/16
to julia-users
So, in the end, is `@fastmath` supposed to be adding FMA? Should I open an issue?

Erik Schnetter

unread,
Sep 23, 2016, 3:12:22 PM9/23/16
to julia...@googlegroups.com
It should. Yes, please open an issue.

-erik
Reply all
Reply to author
Forward
0 new messages