Do I have simd?

628 views
Skip to first unread message

DNF

unread,
Nov 5, 2015, 9:12:22 AM11/5/15
to julia-users
I have been looking through the performance tips section of the manual. Specifically, I am curious about @simd (http://docs.julialang.org/en/release-0.4/manual/performance-tips/#performance-annotations).

When I cut and paste the code demonstrating the @simd macro, I don't get substantial speedups. Before updating from OSX Yosemite to El Capitan, I saw no speedup whatsoever. After the update, there is a small speedup (I ran the example repeatedly):

julia> timeit(1000,1000)
GFlop        = 1.2292170133468385
GFlop (SIMD) = 1.5351220575547964


This contrasts sharply to the example in the documentation which shows a speedup from 1.95GFlop to 17.6GFlop.

Does my computer not have simd? How can I tell?

This is my versioninfo:

Julia Version 0.4.0
Commit 0ff703b* (2015-10-08 06:20 UTC)
Platform Info:
 
System: Darwin (x86_64-apple-darwin15.0.0)
  CPU
: Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
  WORD_SIZE
: 64
  BLAS
: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK
: libopenblas
  LIBM
: libopenlibm
  LLVM
: libLLVM-3.3

Yichao Yu

unread,
Nov 5, 2015, 10:07:05 AM11/5/15
to Julia Users
You can check with `code_llvm(innersimd,
Tuple{Vector{Float32},Vector{Float32}})`

DNF

unread,
Nov 5, 2015, 2:33:41 PM11/5/15
to julia-users
On Thursday, November 5, 2015 at 4:07:05 PM UTC+1, Yichao Yu wrote:
You can check with `code_llvm(innersimd,
Tuple{Vector{Float32},Vector{Float32}})`

I tried it, and got this output, but don't know how to make sense of it

julia> code_llvm(innersimd, Tuple{Vector{Float32},Vector{Float32}})


define float @julia_innersimd_21674(%jl_value_t*, %jl_value_t*) {


L:

  %2 = bitcast %jl_value_t* %0 to %jl_array_t*
  %3 = getelementptr inbounds %jl_array_t* %2, i32 0, i32 1

 %4 = load i64* %3

 %5 = icmp sle i64 1, %4

 %6 = xor i1 %5, true

 %7 = select i1 %6, i64 0, i64 %4

 %8 = insertvalue %UnitRange.1 { i64 1, i64 undef }, i64 %7, 1

 %9 = extractvalue %UnitRange.1 %8, 1

 %10 = load %jl_value_t** @jl_overflow_exception

 %11 = call { i64, i1 } @llvm.ssub.with.overflow.i64(i64 %9, i64 1)

 %12 = extractvalue { i64, i1 } %11, 1

 %13 = xor i1 %12, true

 br i1 %13, label %pass, label %fail


fail:                                             ; preds = %L

 call void @jl_throw_with_superfluous_argument(%jl_value_t* %10, i32 67)

 unreachable

pass:                                             ; preds = %L


  %14 = extractvalue { i64, i1 } %11, 0

 %15 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %14, i64 1)

 %16 = extractvalue { i64, i1 } %15, 1

 %17 = xor i1 %16, true

 br i1 %17, label %pass2, label %fail1


fail1:                                            ; preds = %pass

 call void @jl_throw_with_superfluous_argument(%jl_value_t* %10, i32 67)

 unreachable

pass2:                                            ; preds = %pass

 %18 = extractvalue { i64, i1 } %15, 0

 %19 = icmp slt i64 0, %18

 %20 = xor i1 %19, true

 br i1 %20, label %L11, label %L5.preheader


L5.preheader:                                     ; preds = %pass2


  %sunkaddr = ptrtoint %jl_value_t* %0 to i64

 %sunkaddr19 = inttoptr i64 %sunkaddr to i8**

 %21 = load i8** %sunkaddr19

 %sunkaddr20 = ptrtoint %jl_value_t* %1 to i64

 %sunkaddr21 = inttoptr i64 %sunkaddr20 to i8**

 %22 = load i8** %sunkaddr21

 br label %L5


L5:                                               ; preds = %L5, %L5.preheader


  %lsr.iv16 = phi i8* [ %22, %L5.preheader ], [ %scevgep17, %L5 ]

 %lsr.iv = phi i8* [ %21, %L5.preheader ], [ %scevgep, %L5 ]

 %"##i#7153.0" = phi i64 [ %27, %L5 ], [ 0, %L5.preheader ]

 %s.1 = phi float [ %26, %L5 ], [ 0.000000e+00, %L5.preheader ]

 %lsr.iv1618 = bitcast i8* %lsr.iv16 to float*

 %lsr.iv15 = bitcast i8* %lsr.iv to float*

 %23 = load float* %lsr.iv15

 %24 = load float* %lsr.iv1618

 %25 = fmul float %23, %24

 %26 = fadd fast float %s.1, %25

 %27 = add i64 %"##i#7153.0", 1

 %scevgep = getelementptr i8* %lsr.iv, i64 4

 %scevgep17 = getelementptr i8* %lsr.iv16, i64 4

 %28 = icmp slt i64 %27, %18

 br i1 %28, label %L5, label %L11


L11:                                              ; preds = %L5, %pass2


  %s.3 = phi float [ 0.000000e+00, %pass2 ], [ %26, %L5 ]

 ret float %s.3

}


 

Kristoffer Carlsson

unread,
Nov 5, 2015, 4:14:32 PM11/5/15
to julia-users
If it got compiled with SIMD instructions it should have a vector body which it doesn't seem to have.

DNF

unread,
Nov 5, 2015, 4:22:28 PM11/5/15
to julia-users
I see. Do you know if I need to install something to get SIMD support?

According to this review of my computer model: "Haswell chips also include new instructions enhancing SIMD vector processing with Advanced Vector Extensions 2".

So what could be wrong?

Benjamin Deonovic

unread,
Nov 5, 2015, 6:09:30 PM11/5/15
to julia-users
Did you compile julia from source or just grab a pre-compiled binary?

DNF

unread,
Nov 5, 2015, 6:15:47 PM11/5/15
to julia-users
I install using homebrew from here: https://github.com/staticfloat/homebrew-julia

I have limited understanding of the process, but believe there is some compilation involved.

Giuseppe Ragusa

unread,
Nov 6, 2015, 6:20:38 AM11/6/15
to julia-users
I am pretty sure must something specific to your installation. On my machine 

```
Darwin Kernel Version 14.5.0: Wed Jul 29 02:26:53 PDT 2015; RELEASE_X86_64 x86_64
```

running the code, I get the following timings:

```
julia> timeit(1000,1000)
GFlop        = 2.4503017546610866
GFlop (SIMD) = 11.622906423980382
```

DNF

unread,
Nov 6, 2015, 6:53:42 AM11/6/15
to julia-users
On Friday, November 6, 2015 at 12:20:38 PM UTC+1, Giuseppe Ragusa wrote:
I am pretty sure must something specific to your installation.

Do you mean my Julia installation? 

Rob J. Goedman

unread,
Nov 6, 2015, 11:02:38 AM11/6/15
to julia...@googlegroups.com
Hi DNF,

I get below results onJulia 0.5 (home-build) and  Julia 0.4 (downloaded).

A clear difference is the presence of a vector block in the output of ‘code_llvm(innersimd, Tuple{Vector{Float32},Vector{Float32}})'

Regards,
Rob
Julia Version 0.5.0-dev+1158
Commit 20786d2* (2015-11-05 14:13 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.0.0)
  CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

First call to timeit(1000,1000):

GFlop        = 2.4158674171961443
GFlop (SIMD) = 14.63560990245366

Second call to timeit(1000,1000):

GFlop        = 2.3526062760477626
GFlop (SIMD) = 16.769379113738314


define float @julia_innersimd_23136(%jl_value_t*, %jl_value_t*) {
L:
  %2 = getelementptr inbounds %jl_value_t* %0, i64 1
  %3 = bitcast %jl_value_t* %2 to i64*
  %4 = load i64* %3, align 8
  %5 = icmp sgt i64 %4, 0
  %6 = select i1 %5, i64 %4, i64 0
  %7 = call { i64, i1 } @llvm.ssub.with.overflow.i64(i64 %6, i64 1)
  %8 = extractvalue { i64, i1 } %7, 1
  br i1 %8, label %fail, label %pass

fail:                                             ; preds = %L
  %9 = load %jl_value_t** @jl_overflow_exception, align 8
  call void @jl_throw(%jl_value_t* %9)
  unreachable

pass:                                             ; preds = %L
  %10 = extractvalue { i64, i1 } %7, 0
  %11 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %10, i64 1)
  %12 = extractvalue { i64, i1 } %11, 1
  br i1 %12, label %fail1, label %pass2

fail1:                                            ; preds = %pass
  %13 = load %jl_value_t** @jl_overflow_exception, align 8
  call void @jl_throw(%jl_value_t* %13)
  unreachable

pass2:                                            ; preds = %pass
  %14 = extractvalue { i64, i1 } %11, 0
  %15 = icmp slt i64 %14, 1
  br i1 %15, label %L11, label %if3

if3:                                              ; preds = %pass2
  %16 = bitcast %jl_value_t* %1 to i8**
  %17 = bitcast %jl_value_t* %0 to i8**
  %18 = load i8** %17, align 8
  %19 = load i8** %16, align 8
  %n.mod.vf = urem i64 %14, 24
  %cmp.zero = icmp eq i64 %14, %n.mod.vf
  br i1 %cmp.zero, label %middle.block, label %vector.ph

vector.ph:                                        ; preds = %if3
  %n.vec = sub i64 %14, %n.mod.vf
  %20 = sub i64 %n.mod.vf, %14
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %lsr.iv41 = phi i64 [ %lsr.iv.next42, %vector.body ], [ 0, %vector.ph ]
  %vec.phi = phi <8 x float> [ zeroinitializer, %vector.ph ], [ %29, %vector.body ]
  %vec.phi12 = phi <8 x float> [ zeroinitializer, %vector.ph ], [ %30, %vector.body ]
  %vec.phi13 = phi <8 x float> [ zeroinitializer, %vector.ph ], [ %31, %vector.body ]
  %21 = mul i64 %lsr.iv41, -4
  %uglygep60 = getelementptr i8* %18, i64 %21
  %uglygep6061 = bitcast i8* %uglygep60 to <8 x float>*
  %wide.load = load <8 x float>* %uglygep6061, align 4
  %22 = mul i64 %lsr.iv41, -4
  %sunkaddr = ptrtoint i8* %18 to i64
  %sunkaddr62 = add i64 %sunkaddr, %22
  %sunkaddr63 = add i64 %sunkaddr62, 32
  %sunkaddr64 = inttoptr i64 %sunkaddr63 to <8 x float>*
  %wide.load16 = load <8 x float>* %sunkaddr64, align 4
  %23 = mul i64 %lsr.iv41, -4
  %sunkaddr65 = ptrtoint i8* %18 to i64
  %sunkaddr66 = add i64 %sunkaddr65, %23
  %sunkaddr67 = add i64 %sunkaddr66, 64
  %sunkaddr68 = inttoptr i64 %sunkaddr67 to <8 x float>*
  %wide.load17 = load <8 x float>* %sunkaddr68, align 4
  %24 = mul i64 %lsr.iv41, -4
  %uglygep = getelementptr i8* %19, i64 %24
  %uglygep43 = bitcast i8* %uglygep to <8 x float>*
  %wide.load18 = load <8 x float>* %uglygep43, align 4
  %sunkaddr69 = ptrtoint i8* %19 to i64
  %sunkaddr70 = add i64 %sunkaddr69, %24
  %sunkaddr71 = add i64 %sunkaddr70, 32
  %sunkaddr72 = inttoptr i64 %sunkaddr71 to <8 x float>*
  %wide.load19 = load <8 x float>* %sunkaddr72, align 4
  %25 = mul i64 %lsr.iv41, -4
  %sunkaddr73 = ptrtoint i8* %19 to i64
  %sunkaddr74 = add i64 %sunkaddr73, %25
  %sunkaddr75 = add i64 %sunkaddr74, 64
  %sunkaddr76 = inttoptr i64 %sunkaddr75 to <8 x float>*
  %wide.load20 = load <8 x float>* %sunkaddr76, align 4
  %26 = fmul <8 x float> %wide.load, %wide.load18
  %27 = fmul <8 x float> %wide.load16, %wide.load19
  %28 = fmul <8 x float> %wide.load17, %wide.load20
  %29 = fadd <8 x float> %vec.phi, %26
  %30 = fadd <8 x float> %vec.phi12, %27
  %31 = fadd <8 x float> %vec.phi13, %28
  %lsr.iv.next42 = add i64 %lsr.iv41, -24
  %32 = icmp eq i64 %20, %lsr.iv.next42
  br i1 %32, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body, %if3
  %resume.val = phi i64 [ 0, %if3 ], [ %n.vec, %vector.body ]
  %rdx.vec.exit.phi = phi <8 x float> [ zeroinitializer, %if3 ], [ %29, %vector.body ]
  %rdx.vec.exit.phi23 = phi <8 x float> [ zeroinitializer, %if3 ], [ %30, %vector.body ]
  %rdx.vec.exit.phi24 = phi <8 x float> [ zeroinitializer, %if3 ], [ %31, %vector.body ]
  %bin.rdx = fadd <8 x float> %rdx.vec.exit.phi23, %rdx.vec.exit.phi
  %bin.rdx25 = fadd <8 x float> %rdx.vec.exit.phi24, %bin.rdx
  %rdx.shuf = shufflevector <8 x float> %bin.rdx25, <8 x float> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
  %bin.rdx26 = fadd <8 x float> %bin.rdx25, %rdx.shuf
  %rdx.shuf27 = shufflevector <8 x float> %bin.rdx26, <8 x float> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %bin.rdx28 = fadd <8 x float> %bin.rdx26, %rdx.shuf27
  %rdx.shuf29 = shufflevector <8 x float> %bin.rdx28, <8 x float> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %bin.rdx30 = fadd <8 x float> %bin.rdx28, %rdx.shuf29
  %33 = extractelement <8 x float> %bin.rdx30, i32 0
  %cmp.n = icmp eq i64 %14, %resume.val
  br i1 %cmp.n, label %L11, label %L5.preheader

L5.preheader:                                     ; preds = %middle.block
  %34 = mul i64 %resume.val, 4
  %scevgep = getelementptr i8* %19, i64 %34
  %scevgep36 = getelementptr i8* %18, i64 %34

<SNIPPED>

-----------------------------------------------------------------------------------------------

Julia Version 0.4.0
Commit 0ff703b* (2015-10-08 06:20 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

First call to timeit(1000,1000):

GFlop        = 2.552215131317849
GFlop (SIMD) = 13.911108019753776

Second call to timeit(1000,1000):

GFlop        = 2.553179538308544
GFlop (SIMD) = 14.156285390713476


define float @julia_innersimd_24595(%jl_value_t*, %jl_value_t*) {
L:
  %2 = getelementptr inbounds %jl_value_t* %0, i64 1
  %3 = bitcast %jl_value_t* %2 to i64*
  %4 = load i64* %3, align 8
  %5 = icmp sgt i64 %4, 0
  %6 = select i1 %5, i64 %4, i64 0
  %7 = call { i64, i1 } @llvm.ssub.with.overflow.i64(i64 %6, i64 1)
  %8 = extractvalue { i64, i1 } %7, 1
  br i1 %8, label %fail, label %pass

fail:                                             ; preds = %L
  %9 = load %jl_value_t** @jl_overflow_exception, align 8
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %9, i32 67)
  unreachable

pass:                                             ; preds = %L
  %10 = extractvalue { i64, i1 } %7, 0
  %11 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %10, i64 1)
  %12 = extractvalue { i64, i1 } %11, 1
  br i1 %12, label %fail1, label %pass2

fail1:                                            ; preds = %pass
  %13 = load %jl_value_t** @jl_overflow_exception, align 8
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %13, i32 67)
  unreachable

pass2:                                            ; preds = %pass
  %14 = extractvalue { i64, i1 } %11, 0
  %15 = icmp slt i64 %14, 1
  br i1 %15, label %L11, label %if3

if3:                                              ; preds = %pass2
  %16 = bitcast %jl_value_t* %1 to i8**
  %17 = bitcast %jl_value_t* %0 to i8**
  %18 = load i8** %17, align 8
  %19 = load i8** %16, align 8
  %n.vec = and i64 %14, -8
  %cmp.zero = icmp eq i64 %n.vec, 0
  br i1 %cmp.zero, label %middle.block, label %vector.body.preheader

vector.body.preheader:                            ; preds = %if3
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.body.preheader
  %lsr.iv32 = phi i64 [ 0, %vector.body.preheader ], [ %lsr.iv.next33, %vector.body ]
  %vec.phi = phi <4 x float> [ %25, %vector.body ], [ zeroinitializer, %vector.body.preheader ]
  %vec.phi13 = phi <4 x float> [ %26, %vector.body ], [ zeroinitializer, %vector.body.preheader ]
  %20 = mul i64 %lsr.iv32, -4
  %uglygep43 = getelementptr i8* %18, i64 %20
  %uglygep4344 = bitcast i8* %uglygep43 to <4 x float>*
  %wide.load = load <4 x float>* %uglygep4344, align 4
  %21 = mul i64 %lsr.iv32, -4
  %sunkaddr = ptrtoint i8* %18 to i64
  %sunkaddr45 = add i64 %sunkaddr, %21
  %sunkaddr46 = add i64 %sunkaddr45, 16
  %sunkaddr47 = inttoptr i64 %sunkaddr46 to <4 x float>*
  %wide.load14 = load <4 x float>* %sunkaddr47, align 4
  %22 = mul i64 %lsr.iv32, -4
  %uglygep = getelementptr i8* %19, i64 %22
  %uglygep34 = bitcast i8* %uglygep to <4 x float>*
  %wide.load15 = load <4 x float>* %uglygep34, align 4
  %sunkaddr48 = ptrtoint i8* %19 to i64
  %sunkaddr49 = add i64 %sunkaddr48, %22
  %sunkaddr50 = add i64 %sunkaddr49, 16
  %sunkaddr51 = inttoptr i64 %sunkaddr50 to <4 x float>*
  %wide.load16 = load <4 x float>* %sunkaddr51, align 4
  %23 = fmul <4 x float> %wide.load, %wide.load15
  %24 = fmul <4 x float> %wide.load14, %wide.load16
  %25 = fadd <4 x float> %vec.phi, %23
  %26 = fadd <4 x float> %vec.phi13, %24
  %lsr.iv.next33 = add i64 %lsr.iv32, -8
  %27 = add i64 %n.vec, %lsr.iv.next33
  %28 = icmp eq i64 %27, 0
  br i1 %28, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body, %if3
  %resume.val = phi i64 [ 0, %if3 ], [ %n.vec, %vector.body ]
  %rdx.vec.exit.phi = phi <4 x float> [ zeroinitializer, %if3 ], [ %25, %vector.body ]
  %rdx.vec.exit.phi19 = phi <4 x float> [ zeroinitializer, %if3 ], [ %26, %vector.body ]
  %bin.rdx = fadd <4 x float> %rdx.vec.exit.phi19, %rdx.vec.exit.phi
  %rdx.shuf = shufflevector <4 x float> %bin.rdx, <4 x float> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
  %bin.rdx20 = fadd <4 x float> %bin.rdx, %rdx.shuf
  %rdx.shuf21 = shufflevector <4 x float> %bin.rdx20, <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %bin.rdx22 = fadd <4 x float> %bin.rdx20, %rdx.shuf21
  %29 = extractelement <4 x float> %bin.rdx22, i32 0
  %cmp.n = icmp eq i64 %14, %resume.val
  br i1 %cmp.n, label %L11, label %L5.preheader

L5.preheader:                                     ; preds = %middle.block
  %30 = mul i64 %resume.val, 4
  %scevgep = getelementptr i8* %19, i64 %30

<SNIPPED>

DNF

unread,
Nov 6, 2015, 1:35:50 PM11/6/15
to julia-users
Thanks for the feedback. It seems like this is not a problem for most.

If anyone has even the faintest clue where I could start looking for a solution to this, I would be grateful. Perhaps there is some software I could run that would detect hardware problems, or maybe I am missing software dependencies of some kind? What could I even google for? All my searches just seem to bring up general info about SIMD, nothing like what I'm describing.

Rob J. Goedman

unread,
Nov 6, 2015, 2:43:41 PM11/6/15
to julia...@googlegroups.com
Hi DNF,

In below versioninfo’s only libopenblas appears different. You installed using brew. The first thing I would try is to execute the steps under Common Issues listed on https://github.com/staticfloat/homebrew-julia. A bit further down on that site there is also some additional openblas related info.

Rob

On Nov 6, 2015, at 10:35 AM, DNF <oyv...@gmail.com> wrote:

Thanks for the feedback. It seems like this is not a problem for most.

If anyone has even the faintest clue where I could start looking for a solution to this, I would be grateful. Perhaps there is some software I could run that would detect hardware problems, or maybe I am missing software dependencies of some kind? What could I even google for? All my searches just seem to bring up general info about SIMD, nothing like what I'm describing.


On Friday, November 6, 2015 at 12:15:47 AM UTC+1, DNF wrote:
I install using homebrew from here: https://github.com/staticfloat/homebrew-julia

I have limited understanding of the process, but believe there is some compilation involved.


Julia Version 0.4.0
Commit 0ff703b* (2015-10-08 06:20 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

Julia Version 0.4.0
Commit 0ff703b* (2015-10-08 06:20 UTC)
Platform Info:

  
System: Darwin (x86_64-apple-darwin15.0.0)

Seth

unread,
Nov 6, 2015, 6:01:52 PM11/6/15
to julia-users
For what it's worth, I'm getting

julia> timeit(1000,1000)
GFlop        = 2.3913033081289967
GFlop (SIMD) = 2.2694726426420293


julia> versioninfo()
Julia Version 0.4.1-pre+22
Commit 669222e* (2015-11-01 00:06 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-svn

so it doesn't look like I'm taking advantage of simd either. :(

DNF

unread,
Nov 6, 2015, 6:14:25 PM11/6/15
to julia-users
OK, wow! I tried following the advice in  https://github.com/staticfloat/homebrew-julia
specifically,
$ brew rm gcc openblas-julia suite-sparse-julia arpack-julia
$ brew install gcc openblas
-julia suite-sparse-julia arpack-julia

Now, there is a difference: the non-SIMD version is much slower than before:

GFlop        = 0.39122184254142406

GFlop (SIMD) = 1.7337076986157214


The SIMD version is basically the same speed as before, but according to code_llvm there is no vectorization.

I think I need to lie down.

Here's the code_llvm outputs:

julia> code_llvm(innersimd, Tuple{Vector{Float32},Vector{Float32}})




define float @julia_innersimd_21416(%jl_value_t*, %jl_value_t*) {


L:


  %2 = bitcast %jl_value_t* %0 to %jl_array_t*


  %3 = getelementptr inbounds %jl_array_t* %2, i32 0, i32 1


  %4 = load i64* %3


  %5 = icmp sle i64 1, %4


  %6 = xor i1 %5, true


  %7 = select i1 %6, i64 0, i64 %4


  %8 = insertvalue %UnitRange.1 { i64 1, i64 undef }, i64 %7, 1


  %9 = extractvalue %UnitRange.1 %8, 1


  %10 = load %jl_value_t** @jl_overflow_exception


  %11 = call { i64, i1 } @llvm.ssub.with.overflow.i64(i64 %9, i64 1)


  %12 = extractvalue { i64, i1 } %11, 1


  %13 = xor i1 %12, true


  br i1 %13, label %pass, label %fail




fail:                                             ; preds = %L


  call void @jl_throw_with_superfluous_argument(%jl_value_t* %10, i32 67)


  unreachable




pass:                                             ; preds = %L


  %14 = extractvalue { i64, i1 } %11, 0


  %15 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %14, i64 1)


  %16 = extractvalue { i64, i1 } %15, 1


  %17 = xor i1 %16, true


  br i1 %17, label %pass2, label %fail1




fail1:                                            ; preds = %pass


  call void @jl_throw_with_superfluous_argument(%jl_value_t* %10, i32 67)


  unreachable




pass2:                                            ; preds = %pass


  %18 = extractvalue { i64, i1 } %15, 0


  %19 = icmp slt i64 0, %18


  %20 = xor i1 %19, true


  br i1 %20, label %L11, label %L5.preheader




L5.preheader:                                     ; preds = %pass2


  %sunkaddr = ptrtoint %jl_value_t* %0 to i64


  %sunkaddr19 = inttoptr i64 %sunkaddr to i8**


  %21 = load i8** %sunkaddr19


  %sunkaddr20 = ptrtoint %jl_value_t* %1 to i64


  %sunkaddr21 = inttoptr i64 %sunkaddr20 to i8**


  %22 = load i8** %sunkaddr21


  br label %L5




L5:                                               ; preds = %L5, %L5.preheader


  %lsr.iv16 = phi i8* [ %22, %L5.preheader ], [ %scevgep17, %L5 ]


  %lsr.iv = phi i8* [ %21, %L5.preheader ], [ %scevgep, %L5 ]


  %"##i#7098.0" = phi i64 [ %27, %L5 ], [ 0, %L5.preheader ]


  %s.1 = phi float [ %26, %L5 ], [ 0.000000e+00, %L5.preheader ]


  %lsr.iv1618 = bitcast i8* %lsr.iv16 to float*


  %lsr.iv15 = bitcast i8* %lsr.iv to float*


  %23 = load float* %lsr.iv15


  %24 = load float* %lsr.iv1618


  %25 = fmul float %23, %24


  %26 = fadd fast float %s.1, %25


  %27 = add i64 %"##i#7098.0", 1


  %scevgep = getelementptr i8* %lsr.iv, i64 4


  %scevgep17 = getelementptr i8* %lsr.iv16, i64 4


  %28 = icmp slt i64 %27, %18


  br i1 %28, label %L5, label %L11




L11:                                              ; preds = %L5, %pass2


  %s.3 = phi float [ 0.000000e+00, %pass2 ], [ %26, %L5 ]


  ret float %s.3


}




julia> code_llvm(inner, Tuple{Vector{Float32},Vector{Float32}})




define float @julia_inner_21415(%jl_value_t*, %jl_value_t*) {


top:


  %2 = bitcast %jl_value_t* %0 to %jl_array_t*


  %3 = getelementptr inbounds %jl_array_t* %2, i32 0, i32 1


  %4 = load i64* %3


  %5 = icmp sle i64 1, %4


  %6 = xor i1 %5, true


  %7 = select i1 %6, i64 0, i64 %4


  %8 = insertvalue %UnitRange.1 { i64 1, i64 undef }, i64 %7, 1


  %9 = extractvalue %UnitRange.1 %8, 1


  %10 = add i64 %9, 1


  %11 = icmp eq i64 1, %10


  br i1 %11, label %L3, label %L.preheader




L.preheader:                                      ; preds = %top


  %12 = bitcast %jl_value_t* %0 to %jl_array_t*


  %13 = bitcast %jl_array_t* %12 to i8**


  %14 = load i8** %13


  %15 = bitcast %jl_value_t* %1 to %jl_array_t*


  %16 = bitcast %jl_array_t* %15 to i8**


  %17 = load i8** %16


  %18 = add i64 %9, -1


  br label %L




L:                                                ; preds = %L, %L.preheader


  %lsr.iv6 = phi i8* [ %14, %L.preheader ], [ %scevgep7, %L ]


  %lsr.iv4 = phi i8* [ %17, %L.preheader ], [ %scevgep, %L ]


  %lsr.iv = phi i64 [ %18, %L.preheader ], [ %lsr.iv.next, %L ]


  %s.0 = phi float [ %22, %L ], [ 0.000000e+00, %L.preheader ]


  %lsr.iv68 = bitcast i8* %lsr.iv6 to float*


  %lsr.iv45 = bitcast i8* %lsr.iv4 to float*


  %19 = load float* %lsr.iv68


  %20 = load float* %lsr.iv45


  %21 = fmul float %19, %20


  %22 = fadd float %s.0, %21


  %23 = icmp eq i64 %lsr.iv, 0


  %24 = xor i1 %23, true


  %lsr.iv.next = add i64 %lsr.iv, -1


  %scevgep = getelementptr i8* %lsr.iv4, i64 4


  %scevgep7 = getelementptr i8* %lsr.iv6, i64 4


  br i1 %24, label %L, label %L3




L3:                                               ; preds = %L, %top


  %s.1 = phi float [ 0.000000e+00, %top ], [ %22, %L ]


  ret float %s.1


}



Rob J. Goedman

unread,
Nov 6, 2015, 6:54:04 PM11/6/15
to julia...@googlegroups.com
Seth,

You must have built  Julia 0.4.1-pre yourself. Did you use brew?

It looks like you are on Yosemite and picked up a newer libLLVM. Which Xcode are you using?
In the Julia.rb formula there is a test ENV.compiler, could it be clang is not being used? 

Rob

Seth

unread,
Nov 6, 2015, 7:53:30 PM11/6/15
to julia-users
Hi Rob,

I built it (and openblas) myself (via git clone) since I'm testing out Cxx.jl. Xcode is Version 7.1 (7B91b).

Seth.

Rob J. Goedman

unread,
Nov 6, 2015, 8:36:06 PM11/6/15
to julia...@googlegroups.com
Thanks Seth,

That's the end of my first attempt to figure out what’s happening here. Back to the drawing board!

Regards,
Rob

Rob J. Goedman

unread,
Nov 8, 2015, 5:51:50 PM11/8/15
to julia...@googlegroups.com
On another, slightly older system, I noticed similar (approximately identical) timings for the simd.jl test script using Julia 0.5:

julia> include("/Users/rob/Projects/Julia/Rob/Julia/simd.jl")

Julia Version 0.5.0-dev+720
Commit 5920633* (2015-10-11 15:15 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.0.0)
  CPU: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

First call to timeit(1000,1000):

GFlop        = 0.6092165323090373
GFlop (SIMD) = 0.4607065672339039

Second call to timeit(1000,1000):

GFlop        = 0.5935117884795207
GFlop (SIMD) = 0.42286883095163036

On that same system Julia 0.4 (installed from the Julia site) did show improved Gflop numbers and about a 6x improvement with simd.

To see if that would help with Julia 0.5, I did (in the cloned julia directory, in a terminal):

make -j 4

Lots of compile messages/warnings, but in the end:

clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[3]: *** [libopenblas64_p-r0.2.15.dylib] Error 1
make[2]: *** [shared] Error 2
*** Clean the OpenBLAS build with 'make -C deps clean-openblas'. Rebuild with 'make OPENBLAS_USE_THREAD=0 if OpenBLAS had trouble linking libpthread.so, and with 'make OPENBLAS_TARGET_ARCH=NEHALEM' if there were errors building SandyBridge support. Both these options can also be used simultaneously. ***
make[1]: *** [build/openblas/libopenblas64_.dylib] Error 1
make: *** [julia-deps] Error 2

I tried:

brew update
brew upgrade
make -C deps clean-openblas
make -j 4

and running the simd.jl script now shows:

julia> include("/Users/rob/Projects/Julia/Rob/Julia/simd.jl")

Julia Version 0.5.0-dev+1195
Commit 68667a3* (2015-11-08 21:05 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.0.0)
  CPU: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

First call to timeit(1000,1000):

GFlop        = 1.4006308441321973
GFlop (SIMD) = 13.561988458747821

Second call to timeit(1000,1000):

GFlop        = 2.300048186009497
GFlop (SIMD) = 12.84397777991844


Not sure if this helps or is even the right way to remedy this.

Regards,
Rob


simd.jl

DNF

unread,
Nov 9, 2015, 2:47:17 PM11/9/15
to julia-users
Thank you very much for taking the time to look into this, Rob.

My understanding now is that this is something that has to do with the build process of Julia, or perhaps with what version of openblas is being used. Am I getting that correctly? Do you think this something I could address to the maintainer of homebrew-julia?

I see that you are running some 'make' commands. Is 'make' something I run after git pulling directly from the main julia git repository, or does it somehow work with homebrew as well?

Rob J. Goedman

unread,
Nov 9, 2015, 3:57:12 PM11/9/15
to julia...@googlegroups.com
Hi DNF,

Those instructions (if they help in all cases) only work if you build Julia yourself by cloning the Julia git repository. You have installed julia via homebrew.
Unfortunately that route doesn’t work for me:

```
rob$ /usr/local/bin/julia
Illegal instruction: 4
```

I expect that maybe this problem is due to older stuff left behind somewhere (/usr/local/Cellar?).

Just now I installed Julia 0.4.1 from julialang.org/downloads and then moved the Julia-0.4.1.app to /Applications (after double clicking the Julia disk). That gave me simd without any problems. Maybe you could try that route?

Regards,
Rob

———————————————————————————————————————————————————————————
<simd.jl>



On Nov 6, 2015, at 5:35 PM, Rob J. Goedman <goe...@icloud.com> wrote:

Thanks Seth,

That's the end of my first attempt to figure out what’s happening here. Back to the drawing board!

Regards,
Rob
On Nov 6, 2015, at 4:53 PM, Seth <catc...@bromberger.com> wrote:

Hi Rob,

I built it (and openblas) myself (via git clone) since I'm testing out Cxx.jl. Xcode is Version 7.1 (7B91b).

Seth.


On Friday, November 6, 2015 at 3:54:04 PM UTC-8, Rob J Goedman wrote:
Seth,

You must have built  Julia 0.4.1-pre yourself. Did you use brew?

It looks like you are on Yosemite and picked up a newer libLLVM. Which Xcode are you using?
In the Julia.rb formula there is a test ENV.compiler, could it be clang is not being used? 

Rob
On Nov 6, 2015, at 3:01 PM, Seth <catc...@bromberger.com> wrote:

For what it's worth, I'm getting

julia> timeit(1000,1000)
GFlop        = 2.3913033081289967
GFlop (SIMD) = 2.2694726426420293


julia> versioninfo()
Julia Version 0.4.1-pre+22
Commit 669222e* (2015-11-01 00:06 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-svn

so it doesn't look like I'm taking advantage of simd either. :(

On Friday, November 6, 2015 at 11:43:41 AM UTC-8, Rob J Goedman wrote:
Hi DNF,

In below versioninfo’s only libopenblas appears different. You installed using brew. The first thing I would try is to execute the steps under Common Issues listed on https://github.com/staticfloat/homebrew-julia. A bit further down on that site there is also some additional openblas related info.

Rob

On Nov 6, 2015, at 10:35 AM, DNF <oyv...@gmail.com> wrote:

Thanks for the feedback. It seems like this is not a problem for most.

If anyone has even the faintest clue where I could start looking for a solution to this, I would be grateful. Perhaps there is some software I could run that would detect hardware problems, or maybe I am missing software dependencies of some kind? What could I even google for? All my searches just seem to bring up general info about SIMD, nothing like what I'm describing.


On Friday, November 6, 2015 at 12:15:47 AM UTC+1, DNF wrote:
I install using homebrew from here: https://github.com/staticfloat/homebrew-julia

I have limited understanding of the process, but believe there is some compilation involved.


Julia Version 0.4.0
Commit 0ff703b* (2015-10-08 06:20 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

Julia Version 0.4.0
Commit 0ff703b* (2015-10-08 06:20 UTC)

Platform Info:
  
System: Darwin (x86_64-apple-darwin15.0.0)

DNF

unread,
Nov 9, 2015, 4:14:24 PM11/9/15
to julia-users
Thanks a lot. That indeed works. The speedup is not particularly large, and varies quite a bit, from 1.5 to 3 times speedup. But it is working, and code_llvm reports a vector block.

Though performance isn't all that impressive, at least I know there is nothing fundamentally stopping the SIMD optimizations from happening.

DNF

unread,
Nov 9, 2015, 4:15:58 PM11/9/15
to julia-users
On Monday, November 9, 2015 at 10:14:24 PM UTC+1, DNF wrote:
Thanks a lot. That indeed works.

Oh, and by "that", I mean installing the Julia 0.4.1 app. 

Rob J. Goedman

unread,
Nov 9, 2015, 4:44:06 PM11/9/15
to julia...@googlegroups.com
Great! 

I just removed all of /usr/local/Cellar, did a full 'brew install julia’, but it still fails.

Will file an issue.

Regards,
Rob
On Nov 9, 2015, at 12:56 PM, Rob J. Goedman <goe...@icloud.com> wrote:

Hi DNF,

Those instructions (if they help in all cases) only work if you build Julia yourself by cloning the Julia git repository. You have installed julia via homebrew.
Unfortunately that route doesn’t work for me:

```
rob$ /usr/local/bin/julia
Illegal instruction: 4
```

I expect that maybe this problem is due to older stuff left behind somewhere (/usr/local/Cellar?).

Just now I installed Julia 0.4.1 from julialang.org/downloads and then moved the Julia-0.4.1.app to /Applications (after double clicking the Julia disk). That gave me simd without any problems. Maybe you could try that route?

Regards,
Rob

Greg Plowman

unread,
Nov 18, 2015, 5:44:47 PM11/18/15
to julia-users
1. Does simd work for Integer types?

code_llvm shows vector.body section for Int16,Int32,Float32,Float64 but not for Int64. (on my Windows 64 machine, Julia v0.4.1)

Speedup is seen for Float32 and to lesser extent Float64, as expected.
Integers show no speed up. Is this because simd is applied implicitly?
Why is Int64 apparently not using simd?

buf = IOBuffer()
n = 10000
reps = 1000
for T in (Int16,Int32,Int64,Float32,Float64)
    code_llvm(buf, innersimd, Tuple{Vector{T},Vector{T}})
    println(T, " ", contains(takebuf_string(buf), "vector.body"))
    timeit(T, n, reps)
end

Int16 true
GFlop Int16        = 14.329049425190183
GFlop Int16 (SIMD) = 14.64120268695352

Int32 true
GFlop Int32        = 4.339303129613899
GFlop Int32 (SIMD) = 4.436321047681579

Int64 false
GFlop Int64        = 2.1942537759816103
GFlop Int64 (SIMD) = 2.195101499298226

Float32 true
GFlop Float32        = 2.1954870446504984
GFlop Float32 (SIMD) = 7.82465266366826

Float64 true
GFlop Float64        = 2.171535919755667
GFlop Float64 (SIMD) = 4.0068798126383


2. Can simd be applied to unrolled statements? Perhaps using some form of meta expression?

begin
   
Expr(:meta, :simd)
    s
= 0
    s
+= x[1]*y[1]
    s
+= x[2]*y[2]
    s
+= x[3]*y[3]
    s
+= x[4]*y[4]
    s
+= x[5]*y[5]
    s
+= x[6]*y[6]
    s
+= x[7]*y[7]
    s
+= x[8]*y[8]
end



Yichao Yu

unread,
Nov 18, 2015, 6:05:41 PM11/18/15
to Julia Users
On Wed, Nov 18, 2015 at 5:44 PM, 'Greg Plowman' via julia-users
<julia...@googlegroups.com> wrote:
> 1. Does simd work for Integer types?
>
> code_llvm shows vector.body section for Int16,Int32,Float32,Float64 but not
> for Int64. (on my Windows 64 machine, Julia v0.4.1)
>
> Speedup is seen for Float32 and to lesser extent Float64, as expected.
> Integers show no speed up. Is this because simd is applied implicitly?
> Why is Int64 apparently not using simd?

Integers can be implicitly vectorized since their addition is
associative (floating point addsitions are not). Whether it will
actually be vectorized depending on the cost model. Your timing shows
that Int64 is ~2x slower than Int32 dispide not being vectorized so
LLVM is making the right decision here.
Not right now but maybe[1]. I'm not sure how metadata can help here
though. Maybe `@fastmath` could relax some LLVM constraint and help
vectorize this case if the LLVM patch linked is merged?

[1] https://github.com/JuliaLang/julia/issues/11899#issuecomment-152604312

>
>
>
Message has been deleted
Message has been deleted

Damien

unread,
Nov 19, 2015, 3:47:42 AM11/19/15
to julia-users
Try with:

    x = rand(Float32,n)::Array{Float32,1}
y = rand(Float32,n)::Array{Float32,1}
s = zero(Float64)::Float64
I believe this fixed a similar issue for me in Julia 0.4. The underlying problem must have been fixed in 0.5-dev.

@code_typed is also very useful in diagnosing failure to vectorize. Check for type instability, unexpected type promotion, overflow checks when converting number types, and non-inlined calls.

I've been trying to make tests for this but they keep failing on the continuous integration machines:


On Thursday, 5 November 2015 15:12:22 UTC+1, DNF wrote:
I have been looking through the performance tips section of the manual. Specifically, I am curious about @simd (http://docs.julialang.org/en/release-0.4/manual/performance-tips/#performance-annotations).

When I cut and paste the code demonstrating the @simd macro, I don't get substantial speedups. Before updating from OSX Yosemite to El Capitan, I saw no speedup whatsoever. After the update, there is a small speedup (I ran the example repeatedly):

julia> timeit(1000,1000)
GFlop        = 1.2292170133468385
GFlop (SIMD) = 1.5351220575547964


This contrasts sharply to the example in the documentation which shows a speedup from 1.95GFlop to 17.6GFlop.

Does my computer not have simd? How can I tell?

This is my versioninfo:
Reply all
Reply to author
Forward
0 new messages