Do I have simd?

DNF

unread,

Nov 5, 2015, 9:12:22 AM11/5/15

to julia-users

I have been looking through the performance tips section of the manual. Specifically, I am curious about @simd (http://docs.julialang.org/en/release-0.4/manual/performance-tips/#performance-annotations).

When I cut and paste the code demonstrating the @simd macro, I don't get substantial speedups. Before updating from OSX Yosemite to El Capitan, I saw no speedup whatsoever. After the update, there is a small speedup (I ran the example repeatedly):

julia> timeit(1000,1000)
GFlop        = 1.2292170133468385
GFlop (SIMD) = 1.5351220575547964

This contrasts sharply to the example in the documentation which shows a speedup from 1.95GFlop to 17.6GFlop.

Does my computer not have simd? How can I tell?

This is my versioninfo:

Julia Version 0.4.0
Commit 0ff703b* (2015-10-08 06:20 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.0.0)
  CPU: Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

Yichao Yu

unread,

Nov 5, 2015, 10:07:05 AM11/5/15

to Julia Users

You can check with `code_llvm(innersimd,
Tuple{Vector{Float32},Vector{Float32}})`

DNF

unread,

Nov 5, 2015, 2:33:41 PM11/5/15

to julia-users

On Thursday, November 5, 2015 at 4:07:05 PM UTC+1, Yichao Yu wrote:

You can check with `code_llvm(innersimd,
Tuple{Vector{Float32},Vector{Float32}})`

I tried it, and got this output, but don't know how to make sense of it

julia> code_llvm(innersimd, Tuple{Vector{Float32},Vector{Float32}})

define float @julia_innersimd_21674(%jl_value_t*, %jl_value_t*) {

L:
   %2 = bitcast %jl_value_t* %0 to %jl_array_t*
  %3 = getelementptr inbounds %jl_array_t* %2, i32 0, i32 1
  %4 = load i64* %3
  %5 = icmp sle i64 1, %4
  %6 = xor i1 %5, true
  %7 = select i1 %6, i64 0, i64 %4
  %8 = insertvalue %UnitRange.1 { i64 1, i64 undef }, i64 %7, 1
  %9 = extractvalue %UnitRange.1 %8, 1
  %10 = load %jl_value_t** @jl_overflow_exception
  %11 = call { i64, i1 } @llvm.ssub.with.overflow.i64(i64 %9, i64 1)
  %12 = extractvalue { i64, i1 } %11, 1
  %13 = xor i1 %12, true
  br i1 %13, label %pass, label %fail

fail:                                             ; preds = %L
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %10, i32 67)
  unreachable

pass:                                             ; preds = %L

  %14 = extractvalue { i64, i1 } %11, 0
  %15 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %14, i64 1)
  %16 = extractvalue { i64, i1 } %15, 1
  %17 = xor i1 %16, true
  br i1 %17, label %pass2, label %fail1

fail1:                                            ; preds = %pass
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %10, i32 67)
  unreachable
pass2:                                            ; preds = %pass
  %18 = extractvalue { i64, i1 } %15, 0
  %19 = icmp slt i64 0, %18
  %20 = xor i1 %19, true
  br i1 %20, label %L11, label %L5.preheader

L5.preheader:                                     ; preds = %pass2

  %sunkaddr = ptrtoint %jl_value_t* %0 to i64
  %sunkaddr19 = inttoptr i64 %sunkaddr to i8**
  %21 = load i8** %sunkaddr19
  %sunkaddr20 = ptrtoint %jl_value_t* %1 to i64
  %sunkaddr21 = inttoptr i64 %sunkaddr20 to i8**
  %22 = load i8** %sunkaddr21
  br label %L5

L5:                                               ; preds = %L5, %L5.preheader

  %lsr.iv16 = phi i8* [ %22, %L5.preheader ], [ %scevgep17, %L5 ]
  %lsr.iv = phi i8* [ %21, %L5.preheader ], [ %scevgep, %L5 ]
  %"##i#7153.0" = phi i64 [ %27, %L5 ], [ 0, %L5.preheader ]
  %s.1 = phi float [ %26, %L5 ], [ 0.000000e+00, %L5.preheader ]
  %lsr.iv1618 = bitcast i8* %lsr.iv16 to float*
  %lsr.iv15 = bitcast i8* %lsr.iv to float*
  %23 = load float* %lsr.iv15
  %24 = load float* %lsr.iv1618
  %25 = fmul float %23, %24
  %26 = fadd fast float %s.1, %25
  %27 = add i64 %"##i#7153.0", 1
  %scevgep = getelementptr i8* %lsr.iv, i64 4
  %scevgep17 = getelementptr i8* %lsr.iv16, i64 4
  %28 = icmp slt i64 %27, %18
  br i1 %28, label %L5, label %L11

L11:                                              ; preds = %L5, %pass2

  %s.3 = phi float [ 0.000000e+00, %pass2 ], [ %26, %L5 ]
  ret float %s.3
}

Kristoffer Carlsson

unread,

Nov 5, 2015, 4:14:32 PM11/5/15

to julia-users

If it got compiled with SIMD instructions it should have a vector body which it doesn't seem to have.

DNF

unread,

Nov 5, 2015, 4:22:28 PM11/5/15

to julia-users

I see. Do you know if I need to install something to get SIMD support?

According to this review of my computer model: "Haswell chips also include new instructions enhancing SIMD vector processing with Advanced Vector Extensions 2".

So what could be wrong?

Benjamin Deonovic

unread,

Nov 5, 2015, 6:09:30 PM11/5/15

to julia-users

Did you compile julia from source or just grab a pre-compiled binary?

DNF

unread,

Nov 5, 2015, 6:15:47 PM11/5/15

to julia-users

I install using homebrew from here: https://github.com/staticfloat/homebrew-julia

I have limited understanding of the process, but believe there is some compilation involved.

Giuseppe Ragusa

unread,

Nov 6, 2015, 6:20:38 AM11/6/15

to julia-users

I am pretty sure must something specific to your installation. On my machine

```

Darwin Kernel Version 14.5.0: Wed Jul 29 02:26:53 PDT 2015; RELEASE_X86_64 x86_64

```

running the code, I get the following timings:

```

julia> timeit(1000,1000)

GFlop = 2.4503017546610866

GFlop (SIMD) = 11.622906423980382

```

DNF

unread,

Nov 6, 2015, 6:53:42 AM11/6/15

to julia-users

On Friday, November 6, 2015 at 12:20:38 PM UTC+1, Giuseppe Ragusa wrote:

I am pretty sure must something specific to your installation.

Do you mean my Julia installation?

Rob J. Goedman

unread,

Nov 6, 2015, 11:02:38 AM11/6/15

to julia...@googlegroups.com

Hi DNF,

I get below results onJulia 0.5 (home-build) and Julia 0.4 (downloaded).

A clear difference is the presence of a vector block in the output of ‘code_llvm(innersimd, Tuple{Vector{Float32},Vector{Float32}})'

Regards,

Rob

Julia Version 0.5.0-dev+1158

Commit 20786d2* (2015-11-05 14:13 UTC)

Platform Info:

  System: Darwin (x86_64-apple-darwin15.0.0)

  CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz

  WORD_SIZE: 64

  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)

  LAPACK: libopenblas64_

  LIBM: libopenlibm

  LLVM: libLLVM-3.3

First call to timeit(1000,1000):

GFlop        = 2.4158674171961443

GFlop (SIMD) = 14.63560990245366

Second call to timeit(1000,1000):

GFlop        = 2.3526062760477626

GFlop (SIMD) = 16.769379113738314

define float @julia_innersimd_23136(%jl_value_t*, %jl_value_t*) {

L:

  %2 = getelementptr inbounds %jl_value_t* %0, i64 1

  %3 = bitcast %jl_value_t* %2 to i64*

  %4 = load i64* %3, align 8

  %5 = icmp sgt i64 %4, 0

  %6 = select i1 %5, i64 %4, i64 0

  %7 = call { i64, i1 } @llvm.ssub.with.overflow.i64(i64 %6, i64 1)

  %8 = extractvalue { i64, i1 } %7, 1

  br i1 %8, label %fail, label %pass

fail:                                             ; preds = %L

  %9 = load %jl_value_t** @jl_overflow_exception, align 8

  call void @jl_throw(%jl_value_t* %9)

  unreachable

pass:                                             ; preds = %L

  %10 = extractvalue { i64, i1 } %7, 0

  %11 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %10, i64 1)

  %12 = extractvalue { i64, i1 } %11, 1

  br i1 %12, label %fail1, label %pass2

fail1:                                            ; preds = %pass

  %13 = load %jl_value_t** @jl_overflow_exception, align 8

  call void @jl_throw(%jl_value_t* %13)

  unreachable

pass2:                                            ; preds = %pass

  %14 = extractvalue { i64, i1 } %11, 0

  %15 = icmp slt i64 %14, 1

  br i1 %15, label %L11, label %if3

if3:                                              ; preds = %pass2

  %16 = bitcast %jl_value_t* %1 to i8**

  %17 = bitcast %jl_value_t* %0 to i8**

  %18 = load i8** %17, align 8

  %19 = load i8** %16, align 8

  %n.mod.vf = urem i64 %14, 24

  %cmp.zero = icmp eq i64 %14, %n.mod.vf

  br i1 %cmp.zero, label %middle.block, label %vector.ph

vector.ph:                                        ; preds = %if3

  %n.vec = sub i64 %14, %n.mod.vf

  %20 = sub i64 %n.mod.vf, %14

  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph

  %lsr.iv41 = phi i64 [ %lsr.iv.next42, %vector.body ], [ 0, %vector.ph ]

  %vec.phi = phi <8 x float> [ zeroinitializer, %vector.ph ], [ %29, %vector.body ]

  %vec.phi12 = phi <8 x float> [ zeroinitializer, %vector.ph ], [ %30, %vector.body ]

  %vec.phi13 = phi <8 x float> [ zeroinitializer, %vector.ph ], [ %31, %vector.body ]

  %21 = mul i64 %lsr.iv41, -4

  %uglygep60 = getelementptr i8* %18, i64 %21

  %uglygep6061 = bitcast i8* %uglygep60 to <8 x float>*

  %wide.load = load <8 x float>* %uglygep6061, align 4

  %22 = mul i64 %lsr.iv41, -4

  %sunkaddr = ptrtoint i8* %18 to i64

  %sunkaddr62 = add i64 %sunkaddr, %22

  %sunkaddr63 = add i64 %sunkaddr62, 32

  %sunkaddr64 = inttoptr i64 %sunkaddr63 to <8 x float>*

  %wide.load16 = load <8 x float>* %sunkaddr64, align 4

  %23 = mul i64 %lsr.iv41, -4

  %sunkaddr65 = ptrtoint i8* %18 to i64

  %sunkaddr66 = add i64 %sunkaddr65, %23

  %sunkaddr67 = add i64 %sunkaddr66, 64

  %sunkaddr68 = inttoptr i64 %sunkaddr67 to <8 x float>*

  %wide.load17 = load <8 x float>* %sunkaddr68, align 4

  %24 = mul i64 %lsr.iv41, -4

  %uglygep = getelementptr i8* %19, i64 %24

  %uglygep43 = bitcast i8* %uglygep to <8 x float>*

  %wide.load18 = load <8 x float>* %uglygep43, align 4

  %sunkaddr69 = ptrtoint i8* %19 to i64

  %sunkaddr70 = add i64 %sunkaddr69, %24

  %sunkaddr71 = add i64 %sunkaddr70, 32

  %sunkaddr72 = inttoptr i64 %sunkaddr71 to <8 x float>*

  %wide.load19 = load <8 x float>* %sunkaddr72, align 4

  %25 = mul i64 %lsr.iv41, -4

  %sunkaddr73 = ptrtoint i8* %19 to i64

  %sunkaddr74 = add i64 %sunkaddr73, %25

  %sunkaddr75 = add i64 %sunkaddr74, 64

  %sunkaddr76 = inttoptr i64 %sunkaddr75 to <8 x float>*

  %wide.load20 = load <8 x float>* %sunkaddr76, align 4

  %26 = fmul <8 x float> %wide.load, %wide.load18

  %27 = fmul <8 x float> %wide.load16, %wide.load19

  %28 = fmul <8 x float> %wide.load17, %wide.load20

  %29 = fadd <8 x float> %vec.phi, %26

  %30 = fadd <8 x float> %vec.phi12, %27

  %31 = fadd <8 x float> %vec.phi13, %28

  %lsr.iv.next42 = add i64 %lsr.iv41, -24

  %32 = icmp eq i64 %20, %lsr.iv.next42

  br i1 %32, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body, %if3

  %resume.val = phi i64 [ 0, %if3 ], [ %n.vec, %vector.body ]

  %rdx.vec.exit.phi = phi <8 x float> [ zeroinitializer, %if3 ], [ %29, %vector.body ]

  %rdx.vec.exit.phi23 = phi <8 x float> [ zeroinitializer, %if3 ], [ %30, %vector.body ]

  %rdx.vec.exit.phi24 = phi <8 x float> [ zeroinitializer, %if3 ], [ %31, %vector.body ]

  %bin.rdx = fadd <8 x float> %rdx.vec.exit.phi23, %rdx.vec.exit.phi

  %bin.rdx25 = fadd <8 x float> %rdx.vec.exit.phi24, %bin.rdx

  %rdx.shuf = shufflevector <8 x float> %bin.rdx25, <8 x float> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>

  %bin.rdx26 = fadd <8 x float> %bin.rdx25, %rdx.shuf

  %rdx.shuf27 = shufflevector <8 x float> %bin.rdx26, <8 x float> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>

  %bin.rdx28 = fadd <8 x float> %bin.rdx26, %rdx.shuf27

  %rdx.shuf29 = shufflevector <8 x float> %bin.rdx28, <8 x float> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>

  %bin.rdx30 = fadd <8 x float> %bin.rdx28, %rdx.shuf29

  %33 = extractelement <8 x float> %bin.rdx30, i32 0

  %cmp.n = icmp eq i64 %14, %resume.val

  br i1 %cmp.n, label %L11, label %L5.preheader

L5.preheader:                                     ; preds = %middle.block

  %34 = mul i64 %resume.val, 4

  %scevgep = getelementptr i8* %19, i64 %34

  %scevgep36 = getelementptr i8* %18, i64 %34

<SNIPPED>

-----------------------------------------------------------------------------------------------

Julia Version 0.4.0
Commit 0ff703b* (2015-10-08 06:20 UTC)
Platform Info:

  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

First call to timeit(1000,1000):

GFlop        = 2.552215131317849
GFlop (SIMD) = 13.911108019753776

Second call to timeit(1000,1000):

GFlop        = 2.553179538308544
GFlop (SIMD) = 14.156285390713476


define float @julia_innersimd_24595(%jl_value_t*, %jl_value_t*) {
L:
  %2 = getelementptr inbounds %jl_value_t* %0, i64 1
  %3 = bitcast %jl_value_t* %2 to i64*
  %4 = load i64* %3, align 8
  %5 = icmp sgt i64 %4, 0
  %6 = select i1 %5, i64 %4, i64 0
  %7 = call { i64, i1 } @llvm.ssub.with.overflow.i64(i64 %6, i64 1)
  %8 = extractvalue { i64, i1 } %7, 1
  br i1 %8, label %fail, label %pass

fail:                                             ; preds = %L
  %9 = load %jl_value_t** @jl_overflow_exception, align 8
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %9, i32 67)

  unreachable

pass:                                             ; preds = %L

  %10 = extractvalue { i64, i1 } %7, 0
  %11 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %10, i64 1)

  %12 = extractvalue { i64, i1 } %11, 1

  br i1 %12, label %fail1, label %pass2

fail1:                                            ; preds = %pass
  %13 = load %jl_value_t** @jl_overflow_exception, align 8
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %13, i32 67)

  unreachable

pass2:                                            ; preds = %pass

  %14 = extractvalue { i64, i1 } %11, 0

  %15 = icmp slt i64 %14, 1
  br i1 %15, label %L11, label %if3

if3:                                              ; preds = %pass2
  %16 = bitcast %jl_value_t* %1 to i8**
  %17 = bitcast %jl_value_t* %0 to i8**
  %18 = load i8** %17, align 8
  %19 = load i8** %16, align 8
  %n.vec = and i64 %14, -8
  %cmp.zero = icmp eq i64 %n.vec, 0
  br i1 %cmp.zero, label %middle.block, label %vector.body.preheader

vector.body.preheader:                            ; preds = %if3
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.body.preheader
  %lsr.iv32 = phi i64 [ 0, %vector.body.preheader ], [ %lsr.iv.next33, %vector.body ]
  %vec.phi = phi <4 x float> [ %25, %vector.body ], [ zeroinitializer, %vector.body.preheader ]
  %vec.phi13 = phi <4 x float> [ %26, %vector.body ], [ zeroinitializer, %vector.body.preheader ]
  %20 = mul i64 %lsr.iv32, -4
  %uglygep43 = getelementptr i8* %18, i64 %20
  %uglygep4344 = bitcast i8* %uglygep43 to <4 x float>*
  %wide.load = load <4 x float>* %uglygep4344, align 4
  %21 = mul i64 %lsr.iv32, -4
  %sunkaddr = ptrtoint i8* %18 to i64
  %sunkaddr45 = add i64 %sunkaddr, %21
  %sunkaddr46 = add i64 %sunkaddr45, 16
  %sunkaddr47 = inttoptr i64 %sunkaddr46 to <4 x float>*
  %wide.load14 = load <4 x float>* %sunkaddr47, align 4
  %22 = mul i64 %lsr.iv32, -4
  %uglygep = getelementptr i8* %19, i64 %22
  %uglygep34 = bitcast i8* %uglygep to <4 x float>*
  %wide.load15 = load <4 x float>* %uglygep34, align 4
  %sunkaddr48 = ptrtoint i8* %19 to i64
  %sunkaddr49 = add i64 %sunkaddr48, %22
  %sunkaddr50 = add i64 %sunkaddr49, 16
  %sunkaddr51 = inttoptr i64 %sunkaddr50 to <4 x float>*
  %wide.load16 = load <4 x float>* %sunkaddr51, align 4
  %23 = fmul <4 x float> %wide.load, %wide.load15
  %24 = fmul <4 x float> %wide.load14, %wide.load16
  %25 = fadd <4 x float> %vec.phi, %23
  %26 = fadd <4 x float> %vec.phi13, %24
  %lsr.iv.next33 = add i64 %lsr.iv32, -8
  %27 = add i64 %n.vec, %lsr.iv.next33
  %28 = icmp eq i64 %27, 0
  br i1 %28, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body, %if3
  %resume.val = phi i64 [ 0, %if3 ], [ %n.vec, %vector.body ]
  %rdx.vec.exit.phi = phi <4 x float> [ zeroinitializer, %if3 ], [ %25, %vector.body ]
  %rdx.vec.exit.phi19 = phi <4 x float> [ zeroinitializer, %if3 ], [ %26, %vector.body ]
  %bin.rdx = fadd <4 x float> %rdx.vec.exit.phi19, %rdx.vec.exit.phi
  %rdx.shuf = shufflevector <4 x float> %bin.rdx, <4 x float> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
  %bin.rdx20 = fadd <4 x float> %bin.rdx, %rdx.shuf
  %rdx.shuf21 = shufflevector <4 x float> %bin.rdx20, <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %bin.rdx22 = fadd <4 x float> %bin.rdx20, %rdx.shuf21
  %29 = extractelement <4 x float> %bin.rdx22, i32 0
  %cmp.n = icmp eq i64 %14, %resume.val
  br i1 %cmp.n, label %L11, label %L5.preheader

L5.preheader:                                     ; preds = %middle.block
  %30 = mul i64 %resume.val, 4
  %scevgep = getelementptr i8* %19, i64 %30

<SNIPPED>

DNF

unread,

Nov 6, 2015, 1:35:50 PM11/6/15

to julia-users

Thanks for the feedback. It seems like this is not a problem for most.

If anyone has even the faintest clue where I could start looking for a solution to this, I would be grateful. Perhaps there is some software I could run that would detect hardware problems, or maybe I am missing software dependencies of some kind? What could I even google for? All my searches just seem to bring up general info about SIMD, nothing like what I'm describing.

Rob J. Goedman

unread,

Nov 6, 2015, 2:43:41 PM11/6/15

to julia...@googlegroups.com

Hi DNF,

In below versioninfo’s only libopenblas appears different. You installed using brew. The first thing I would try is to execute the steps under Common Issues listed on https://github.com/staticfloat/homebrew-julia. A bit further down on that site there is also some additional openblas related info.

Rob

On Nov 6, 2015, at 10:35 AM, DNF <oyv...@gmail.com> wrote:

Thanks for the feedback. It seems like this is not a problem for most.

If anyone has even the faintest clue where I could start looking for a solution to this, I would be grateful. Perhaps there is some software I could run that would detect hardware problems, or maybe I am missing software dependencies of some kind? What could I even google for? All my searches just seem to bring up general info about SIMD, nothing like what I'm describing.

On Friday, November 6, 2015 at 12:15:47 AM UTC+1, DNF wrote:

I install using homebrew from here: https://github.com/staticfloat/homebrew-julia
I have limited understanding of the process, but believe there is some compilation involved.

Julia Version 0.4.0
Commit 0ff703b* (2015-10-08 06:20 UTC)
Platform Info:

  System: Darwin (x86_64-apple-darwin13.4.0)

  CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz

  WORD_SIZE: 64

  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)

  LAPACK: libopenblas64_

  LIBM: libopenlibm

  LLVM: libLLVM-3.3

Julia Version 0.4.0
Commit 0ff703b* (2015-10-08 06:20 UTC)
Platform Info:


  System: Darwin (x86_64-apple-darwin15.0.0)

Seth

unread,

Nov 6, 2015, 6:01:52 PM11/6/15

to julia-users

For what it's worth, I'm getting

julia> timeit(1000,1000)
GFlop        = 2.3913033081289967
GFlop (SIMD) = 2.2694726426420293


julia> versioninfo()
Julia Version 0.4.1-pre+22
Commit 669222e* (2015-11-01 00:06 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz

  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm

LLVM: libLLVM-svn

so it doesn't look like I'm taking advantage of simd either. :(

DNF

unread,

Nov 6, 2015, 6:14:25 PM11/6/15

to julia-users

OK, wow! I tried following the advice in https://github.com/staticfloat/homebrew-julia

specifically,

$ brew rm gcc openblas-julia suite-sparse-julia arpack-julia
$ brew install gcc openblas-julia suite-sparse-julia arpack-julia

Now, there is a difference: the non-SIMD version is much slower than before:

GFlop        = 0.39122184254142406
GFlop (SIMD) = 1.7337076986157214

The SIMD version is basically the same speed as before, but according to code_llvm there is no vectorization.

I think I need to lie down.

Here's the code_llvm outputs:

julia> code_llvm(innersimd, Tuple{Vector{Float32},Vector{Float32}})



define float @julia_innersimd_21416(%jl_value_t*, %jl_value_t*) {


L:

  %2 = bitcast %jl_value_t* %0 to %jl_array_t*

  %3 = getelementptr inbounds %jl_array_t* %2, i32 0, i32 1

  %4 = load i64* %3

  %5 = icmp sle i64 1, %4

  %6 = xor i1 %5, true

  %7 = select i1 %6, i64 0, i64 %4

  %8 = insertvalue %UnitRange.1 { i64 1, i64 undef }, i64 %7, 1

  %9 = extractvalue %UnitRange.1 %8, 1

  %10 = load %jl_value_t** @jl_overflow_exception

%11 = call { i64, i1 } @llvm.ssub.with.overflow.i64(i64 %9, i64 1)


  %12 = extractvalue { i64, i1 } %11, 1

  %13 = xor i1 %12, true

  br i1 %13, label %pass, label %fail



fail:                                             ; preds = %L

call void @jl_throw_with_superfluous_argument(%jl_value_t* %10, i32 67)


  unreachable



pass:                                             ; preds = %L

  %14 = extractvalue { i64, i1 } %11, 0

  %15 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %14, i64 1)

  %16 = extractvalue { i64, i1 } %15, 1

  %17 = xor i1 %16, true

  br i1 %17, label %pass2, label %fail1



fail1:                                            ; preds = %pass

call void @jl_throw_with_superfluous_argument(%jl_value_t* %10, i32 67)


  unreachable



pass2:                                            ; preds = %pass

  %18 = extractvalue { i64, i1 } %15, 0

  %19 = icmp slt i64 0, %18

  %20 = xor i1 %19, true

  br i1 %20, label %L11, label %L5.preheader



L5.preheader:                                     ; preds = %pass2

  %sunkaddr = ptrtoint %jl_value_t* %0 to i64

  %sunkaddr19 = inttoptr i64 %sunkaddr to i8**

  %21 = load i8** %sunkaddr19

  %sunkaddr20 = ptrtoint %jl_value_t* %1 to i64

  %sunkaddr21 = inttoptr i64 %sunkaddr20 to i8**

  %22 = load i8** %sunkaddr21

  br label %L5



L5:                                               ; preds = %L5, %L5.preheader

  %lsr.iv16 = phi i8* [ %22, %L5.preheader ], [ %scevgep17, %L5 ]

  %lsr.iv = phi i8* [ %21, %L5.preheader ], [ %scevgep, %L5 ]

%"##i#7098.0" = phi i64 [ %27, %L5 ], [ 0, %L5.preheader ]


  %s.1 = phi float [ %26, %L5 ], [ 0.000000e+00, %L5.preheader ]

  %lsr.iv1618 = bitcast i8* %lsr.iv16 to float*

  %lsr.iv15 = bitcast i8* %lsr.iv to float*

  %23 = load float* %lsr.iv15

  %24 = load float* %lsr.iv1618

  %25 = fmul float %23, %24

  %26 = fadd fast float %s.1, %25

%27 = add i64 %"##i#7098.0", 1


  %scevgep = getelementptr i8* %lsr.iv, i64 4

  %scevgep17 = getelementptr i8* %lsr.iv16, i64 4

  %28 = icmp slt i64 %27, %18

  br i1 %28, label %L5, label %L11



L11:                                              ; preds = %L5, %pass2

  %s.3 = phi float [ 0.000000e+00, %pass2 ], [ %26, %L5 ]

  ret float %s.3

}

julia> code_llvm(inner, Tuple{Vector{Float32},Vector{Float32}})



define float @julia_inner_21415(%jl_value_t*, %jl_value_t*) {

top:


  %2 = bitcast %jl_value_t* %0 to %jl_array_t*

  %3 = getelementptr inbounds %jl_array_t* %2, i32 0, i32 1

  %4 = load i64* %3

  %5 = icmp sle i64 1, %4

  %6 = xor i1 %5, true

  %7 = select i1 %6, i64 0, i64 %4

  %8 = insertvalue %UnitRange.1 { i64 1, i64 undef }, i64 %7, 1

  %9 = extractvalue %UnitRange.1 %8, 1

  %10 = add i64 %9, 1

  %11 = icmp eq i64 1, %10

  br i1 %11, label %L3, label %L.preheader



L.preheader:                                      ; preds = %top

  %12 = bitcast %jl_value_t* %0 to %jl_array_t*

  %13 = bitcast %jl_array_t* %12 to i8**

  %14 = load i8** %13

  %15 = bitcast %jl_value_t* %1 to %jl_array_t*

  %16 = bitcast %jl_array_t* %15 to i8**

  %17 = load i8** %16

  %18 = add i64 %9, -1

  br label %L



L:                                                ; preds = %L, %L.preheader

  %lsr.iv6 = phi i8* [ %14, %L.preheader ], [ %scevgep7, %L ]

  %lsr.iv4 = phi i8* [ %17, %L.preheader ], [ %scevgep, %L ]

  %lsr.iv = phi i64 [ %18, %L.preheader ], [ %lsr.iv.next, %L ]

  %s.0 = phi float [ %22, %L ], [ 0.000000e+00, %L.preheader ]

  %lsr.iv68 = bitcast i8* %lsr.iv6 to float*

  %lsr.iv45 = bitcast i8* %lsr.iv4 to float*

  %19 = load float* %lsr.iv68

  %20 = load float* %lsr.iv45

  %21 = fmul float %19, %20

  %22 = fadd float %s.0, %21

  %23 = icmp eq i64 %lsr.iv, 0

  %24 = xor i1 %23, true

  %lsr.iv.next = add i64 %lsr.iv, -1

  %scevgep = getelementptr i8* %lsr.iv4, i64 4

  %scevgep7 = getelementptr i8* %lsr.iv6, i64 4

  br i1 %24, label %L, label %L3



L3:                                               ; preds = %L, %top

  %s.1 = phi float [ 0.000000e+00, %top ], [ %22, %L ]

  ret float %s.1

}

Rob J. Goedman

unread,

Nov 6, 2015, 6:54:04 PM11/6/15

to julia...@googlegroups.com

Seth,

You must have built Julia 0.4.1-pre yourself. Did you use brew?

It looks like you are on Yosemite and picked up a newer libLLVM. Which Xcode are you using?

In the Julia.rb formula there is a test ENV.compiler, could it be clang is not being used?

Rob

Seth

unread,

Nov 6, 2015, 7:53:30 PM11/6/15

to julia-users

Hi Rob,

I built it (and openblas) myself (via git clone) since I'm testing out Cxx.jl. Xcode is Version 7.1 (7B91b).

Seth.

Rob J. Goedman

unread,

Nov 6, 2015, 8:36:06 PM11/6/15

to julia...@googlegroups.com

Thanks Seth,

That's the end of my first attempt to figure out what’s happening here. Back to the drawing board!

Regards,

Rob

Rob J. Goedman

unread,

Nov 8, 2015, 5:51:50 PM11/8/15

to julia...@googlegroups.com

On another, slightly older system, I noticed similar (approximately identical) timings for the simd.jl test script using Julia 0.5:

julia> include("/Users/rob/Projects/Julia/Rob/Julia/simd.jl")

Julia Version 0.5.0-dev+720

Commit 5920633* (2015-10-11 15:15 UTC)

Platform Info:

  System: Darwin (x86_64-apple-darwin15.0.0)

  CPU: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz

  WORD_SIZE: 64

  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)

  LAPACK: libopenblas64_

  LIBM: libopenlibm

  LLVM: libLLVM-3.3

First call to timeit(1000,1000):

GFlop        = 0.6092165323090373

GFlop (SIMD) = 0.4607065672339039

Second call to timeit(1000,1000):

GFlop        = 0.5935117884795207

GFlop (SIMD) = 0.42286883095163036

On that same system Julia 0.4 (installed from the Julia site) did show improved Gflop numbers and about a 6x improvement with simd.

To see if that would help with Julia 0.5, I did (in the cloned julia directory, in a terminal):

git pull https://github.com/JuliaLang/julia master

make -j 4

Lots of compile messages/warnings, but in the end:

clang: error: linker command failed with exit code 1 (use -v to see invocation)

make[3]: *** [libopenblas64_p-r0.2.15.dylib] Error 1

make[2]: *** [shared] Error 2

*** Clean the OpenBLAS build with 'make -C deps clean-openblas'. Rebuild with 'make OPENBLAS_USE_THREAD=0 if OpenBLAS had trouble linking libpthread.so, and with 'make OPENBLAS_TARGET_ARCH=NEHALEM' if there were errors building SandyBridge support. Both these options can also be used simultaneously. ***

make[1]: *** [build/openblas/libopenblas64_.dylib] Error 1

make: *** [julia-deps] Error 2

I tried:

brew update

brew upgrade

make -C deps clean-openblas

make -j 4

and running the simd.jl script now shows:

julia> include("/Users/rob/Projects/Julia/Rob/Julia/simd.jl")

Julia Version 0.5.0-dev+1195

Commit 68667a3* (2015-11-08 21:05 UTC)

Platform Info:

  System: Darwin (x86_64-apple-darwin15.0.0)

  CPU: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz

  WORD_SIZE: 64

  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)

  LAPACK: libopenblas64_

  LIBM: libopenlibm

  LLVM: libLLVM-3.3

First call to timeit(1000,1000):

GFlop        = 1.4006308441321973

GFlop (SIMD) = 13.561988458747821

Second call to timeit(1000,1000):

GFlop        = 2.300048186009497

GFlop (SIMD) = 12.84397777991844

Not sure if this helps or is even the right way to remedy this.

Regards,

Rob

simd.jl

DNF

unread,

Nov 9, 2015, 2:47:17 PM11/9/15

to julia-users

Thank you very much for taking the time to look into this, Rob.

My understanding now is that this is something that has to do with the build process of Julia, or perhaps with what version of openblas is being used. Am I getting that correctly? Do you think this something I could address to the maintainer of homebrew-julia?

I see that you are running some 'make' commands. Is 'make' something I run after git pulling directly from the main julia git repository, or does it somehow work with homebrew as well?

Rob J. Goedman

unread,

Nov 9, 2015, 3:57:12 PM11/9/15

to julia...@googlegroups.com

Hi DNF,

Those instructions (if they help in all cases) only work if you build Julia yourself by cloning the Julia git repository. You have installed julia via homebrew.

Unfortunately that route doesn’t work for me:

```

rob$ /usr/local/bin/julia

Illegal instruction: 4

```

I expect that maybe this problem is due to older stuff left behind somewhere (/usr/local/Cellar?).

Just now I installed Julia 0.4.1 from julialang.org/downloads and then moved the Julia-0.4.1.app to /Applications (after double clicking the Julia disk). That gave me simd without any problems. Maybe you could try that route?

Regards,

Rob

———————————————————————————————————————————————————————————

<simd.jl>

On Nov 6, 2015, at 5:35 PM, Rob J. Goedman <goe...@icloud.com> wrote:

Thanks Seth,

That's the end of my first attempt to figure out what’s happening here. Back to the drawing board!

Regards,
Rob

On Nov 6, 2015, at 4:53 PM, Seth <catc...@bromberger.com> wrote:

Hi Rob,

I built it (and openblas) myself (via git clone) since I'm testing out Cxx.jl. Xcode is Version 7.1 (7B91b).

Seth.

On Friday, November 6, 2015 at 3:54:04 PM UTC-8, Rob J Goedman wrote:

Seth,

You must have built Julia 0.4.1-pre yourself. Did you use brew?

It looks like you are on Yosemite and picked up a newer libLLVM. Which Xcode are you using?
In the Julia.rb formula there is a test ENV.compiler, could it be clang is not being used?

Rob

On Nov 6, 2015, at 3:01 PM, Seth <catc...@bromberger.com> wrote:

For what it's worth, I'm getting

julia> timeit(1000,1000) GFlop = 2.3913033081289967

GFlop (SIMD) = 2.2694726426420293 julia> versioninfo() Julia Version 0.4.1-pre+22

Commit 669222e* (2015-11-01 00:06 UTC) Platform Info:

System: Darwin (x86_64-apple-darwin14.5.0) CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz

WORD_SIZE: 64 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell) LAPACK: libopenblas64_ LIBM: libopenlibm LLVM: libLLVM-svn

so it doesn't look like I'm taking advantage of simd either. :(

On Friday, November 6, 2015 at 11:43:41 AM UTC-8, Rob J Goedman wrote:

Hi DNF,

In below versioninfo’s only libopenblas appears different. You installed using brew. The first thing I would try is to execute the steps under Common Issues listed on https://github.com/staticfloat/homebrew-julia. A bit further down on that site there is also some additional openblas related info.

Rob

On Nov 6, 2015, at 10:35 AM, DNF <oyv...@gmail.com> wrote:

Thanks for the feedback. It seems like this is not a problem for most.

If anyone has even the faintest clue where I could start looking for a solution to this, I would be grateful. Perhaps there is some software I could run that would detect hardware problems, or maybe I am missing software dependencies of some kind? What could I even google for? All my searches just seem to bring up general info about SIMD, nothing like what I'm describing.

On Friday, November 6, 2015 at 12:15:47 AM UTC+1, DNF wrote:
I install using homebrew from here: https://github.com/staticfloat/homebrew-julia
I have limited understanding of the process, but believe there is some compilation involved.

Julia Version 0.4.0
Commit 0ff703b* (2015-10-08 06:20 UTC)
Platform Info:

System: Darwin (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz

WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)

LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.3

Julia Version 0.4.0 Commit 0ff703b* (2015-10-08 06:20 UTC)

Platform Info: System: Darwin (x86_64-apple-darwin15.0.0)

DNF

unread,

Nov 9, 2015, 4:14:24 PM11/9/15

to julia-users

Thanks a lot. That indeed works. The speedup is not particularly large, and varies quite a bit, from 1.5 to 3 times speedup. But it is working, and code_llvm reports a vector block.

Though performance isn't all that impressive, at least I know there is nothing fundamentally stopping the SIMD optimizations from happening.

DNF

unread,

Nov 9, 2015, 4:15:58 PM11/9/15

to julia-users

On Monday, November 9, 2015 at 10:14:24 PM UTC+1, DNF wrote:

Thanks a lot. That indeed works.

Oh, and by "that", I mean installing the Julia 0.4.1 app.

Rob J. Goedman

unread,

Nov 9, 2015, 4:44:06 PM11/9/15

to julia...@googlegroups.com

Great!

I just removed all of /usr/local/Cellar, did a full 'brew install julia’, but it still fails.

Will file an issue.

Regards,

Rob

On Nov 9, 2015, at 12:56 PM, Rob J. Goedman <goe...@icloud.com> wrote:

Hi DNF,

Those instructions (if they help in all cases) only work if you build Julia yourself by cloning the Julia git repository. You have installed julia via homebrew.

Unfortunately that route doesn’t work for me:

```

rob$ /usr/local/bin/julia

Illegal instruction: 4

```

I expect that maybe this problem is due to older stuff left behind somewhere (/usr/local/Cellar?).

Just now I installed Julia 0.4.1 from julialang.org/downloads and then moved the Julia-0.4.1.app to /Applications (after double clicking the Julia disk). That gave me simd without any problems. Maybe you could try that route?

Regards,

Rob

Greg Plowman

unread,

Nov 18, 2015, 5:44:47 PM11/18/15

to julia-users

1. Does simd work for Integer types?

code_llvm shows vector.body section for Int16,Int32,Float32,Float64 but not for Int64. (on my Windows 64 machine, Julia v0.4.1)

Speedup is seen for Float32 and to lesser extent Float64, as expected.

Integers show no speed up. Is this because simd is applied implicitly?

Why is Int64 apparently not using simd?

buf = IOBuffer()
n = 10000
reps = 1000
for T in (Int16,Int32,Int64,Float32,Float64)
    code_llvm(buf, innersimd, Tuple{Vector{T},Vector{T}})
    println(T, " ", contains(takebuf_string(buf), "vector.body"))
    timeit(T, n, reps)
end

Int16 true
GFlop Int16        = 14.329049425190183
GFlop Int16 (SIMD) = 14.64120268695352

Int32 true
GFlop Int32        = 4.339303129613899
GFlop Int32 (SIMD) = 4.436321047681579

Int64 false
GFlop Int64        = 2.1942537759816103
GFlop Int64 (SIMD) = 2.195101499298226

Float32 true
GFlop Float32        = 2.1954870446504984
GFlop Float32 (SIMD) = 7.82465266366826

Float64 true
GFlop Float64        = 2.171535919755667
GFlop Float64 (SIMD) = 4.0068798126383

2. Can simd be applied to unrolled statements? Perhaps using some form of meta expression?

begin
    Expr(:meta, :simd)
    s = 0
    s += x[1]*y[1]
    s += x[2]*y[2]
    s += x[3]*y[3]
    s += x[4]*y[4]
    s += x[5]*y[5]
    s += x[6]*y[6]
    s += x[7]*y[7]
    s += x[8]*y[8]
end

Yichao Yu

unread,

Nov 18, 2015, 6:05:41 PM11/18/15

to Julia Users

On Wed, Nov 18, 2015 at 5:44 PM, 'Greg Plowman' via julia-users
<julia...@googlegroups.com> wrote:
> 1. Does simd work for Integer types?
>
> code_llvm shows vector.body section for Int16,Int32,Float32,Float64 but not
> for Int64. (on my Windows 64 machine, Julia v0.4.1)
>
> Speedup is seen for Float32 and to lesser extent Float64, as expected.
> Integers show no speed up. Is this because simd is applied implicitly?
> Why is Int64 apparently not using simd?

Integers can be implicitly vectorized since their addition is
associative (floating point addsitions are not). Whether it will
actually be vectorized depending on the cost model. Your timing shows
that Int64 is ~2x slower than Int32 dispide not being vectorized so
LLVM is making the right decision here.

Not right now but maybe[1]. I'm not sure how metadata can help here
though. Maybe `@fastmath` could relax some LLVM constraint and help
vectorize this case if the LLVM patch linked is merged?

[1] https://github.com/JuliaLang/julia/issues/11899#issuecomment-152604312

>
>
>

Message has been deleted

Damien

unread,

Nov 19, 2015, 3:47:42 AM11/19/15

to julia-users

Try with:

    x = rand(Float32,n)::Array{Float32,1}

    y = rand(Float32,n)::Array{Float32,1}

    s = zero(Float64)::Float64

I believe this fixed a similar issue for me in Julia 0.4. The underlying problem must have been fixed in 0.5-dev.

@code_typed is also very useful in diagnosing failure to vectorize. Check for type instability, unexpected type promotion, overflow checks when converting number types, and non-inlined calls.

I've been trying to make tests for this but they keep failing on the continuous integration machines:

https://github.com/JuliaLang/julia/issues/13686

On Thursday, 5 November 2015 15:12:22 UTC+1, DNF wrote:

I have been looking through the performance tips section of the manual. Specifically, I am curious about @simd (http://docs.julialang.org/en/release-0.4/manual/performance-tips/#performance-annotations).

When I cut and paste the code demonstrating the @simd macro, I don't get substantial speedups. Before updating from OSX Yosemite to El Capitan, I saw no speedup whatsoever. After the update, there is a small speedup (I ran the example repeatedly):

julia> timeit(1000,1000) GFlop = 1.2292170133468385 GFlop (SIMD) = 1.5351220575547964

This contrasts sharply to the example in the documentation which shows a speedup from 1.95GFlop to 17.6GFlop.

Does my computer not have simd? How can I tell?

This is my versioninfo:

Reply all

Reply to author

Forward