Why does my Julia code run so slow?

Zhixuan Yang

unread,

Feb 19, 2015, 9:51:20 AM2/19/15

to julia...@googlegroups.com

Hello everyone,

Recently I'm working on my first Julia project, a word embedding training program similar to Google's word2vec (the code of word2vec is indeed very high-quality, but I want to add more features, so I decided to write a new one). Thanks to Julia's expressiveness, it cost me less than 2 days to write the entire program. But it runs really slow, about 100x slower than the C code of word2vec (the algorithm is the same). I've been trying to optimize my code for several days (adding type annotations, using BLAS to do computation, eliminating memory allocations ...), but it is still 30x slower than the C code.

The critical part of my program is the following function (it also consumes most of the time according to the profiling result):

function train_one(c :: LinearClassifier, x :: Array{Float64}, y :: Int64; α :: Float64 = 0.025, input_gradient :: Union(Nothing, Array{Float64}) = nothing)
    predict!(c, x)
    c.outputs[y] -= 1

    if input_gradient != nothing
        # input_gradient = ( c.weights * outputs' )'
        BLAS.gemv!('N', α, c.weights, c.outputs, 1.0, input_gradient)
    end

    # c.weights -= α * x' * outputs;
    BLAS.ger!(-α, vec(x), c.outputs, c.weights)
end

function predict!(c :: LinearClassifier, x :: Array{Float64})
    c.outputs = vec(softmax(x * c.weights))
end

type LinearClassifier
    k :: Int64 # number of outputs
    n :: Int64 # number of inputs
    weights :: Array{Float64, 2} # k * n weight matrix

    outputs :: Vector{Float64}
end

And the entire program can be found here. Could you please check my code and tell me what I can do to get performance comparable to C.

Regards.

Yang Zhixuan

Mauro

unread,

Feb 19, 2015, 11:16:16 AM2/19/15

to julia...@googlegroups.com

Without having looked at your code too closely:

Keyword arguments don't work that well. If I recall correctly, their
type cannot be used inside the function body. How is performance if you
don't use the keywords? (although they should not be impacting
performance if not used in a particular call)

Presumably you read:
http://docs.julialang.org/en/latest/manual/performance-tips/

On Thu, 2015-02-19 at 15:51, Zhixuan Yang <jave...@gmail.com> wrote:
> Hello everyone,
>
> Recently I'm working on my first Julia project, a word embedding training

> program similar to Google's word2vec <https://code.google.com/p/word2vec/> (the code

> <https://github.com/yangzhixuan/embed>. Could you please check my code and

Sean Marshallsay

unread,

Feb 19, 2015, 11:19:50 AM2/19/15

to julia...@googlegroups.com

The two things that jump out at me straight away are keyword arguments and nullable fields. Changing those keyword arguments into optional positional arguments might give you a bit of a boost.

The thing that will really make a difference though (I expect) is `input_gradient::Union(Nothing,Array{Float64})`, that will be a massive performance killer and you should probably use an empty array instead (i.e. `input_gradient::Array{Float64}=Float64[]`). If that's not possible consider splitting the function into two methods, one that has an input_gradient argument and one that doesn't, then you can do your nothing checking outside the function.

Tim Holy

unread,

Feb 19, 2015, 12:08:49 PM2/19/15

to julia...@googlegroups.com

In addition to the suggestions from the nice people who are taking the time to
respond to your question, see also
http://docs.julialang.org/en/release-0.3/manual/performance-tips/

--Tim

On Thursday, February 19, 2015 06:51:20 AM Zhixuan Yang wrote:
> Hello everyone,
>
> Recently I'm working on my first Julia project, a word embedding training

> program similar to Google's word2vec <https://code.google.com/p/word2vec/>

> <https://github.com/yangzhixuan/embed>. Could you please check my code and

Message has been deleted

Zhixuan Yang

unread,

Feb 20, 2015, 9:49:50 AM2/20/15

to julia...@googlegroups.com

Mauro, Sean, and Tim, thanks for your help.

Following your suggestions, I removed keyword arguments and split the function to avoid conditional statements. These helped a bit.

But I got a surprising result after replacing BLAS functions with simple for loops, for loops is about 1.5x faster than BLAS calls. My Julia is compiled on my computer with the default configuration (the versioninfo() is listed below). Do you think it will help to compile a Julia with a faster and more optimized BLAS implementation such as Intel's MKL?

Julia Version 0.3.6-pre+70
Commit 638fa02 (2015-02-12 13:59 UTC)
Platform Info:
 System: Darwin (x86_64-apple-darwin14.1.0)
 CPU: Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz
 WORD_SIZE: 64
 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
 LAPACK: libopenblas
 LIBM: libopenlibm
 LLVM: libLLVM-3.3

Regards, Yang Zhixuan

在 2015年2月19日星期四 UTC+8下午10:51:20，Zhixuan Yang写道：

Tim Holy

unread,

Feb 20, 2015, 10:16:40 AM2/20/15

to julia...@googlegroups.com

This definitely happens sometimes, especially with smaller matrices. OpenBLAS
seems to be optimized for larger matrices. It's really good on those larger
matrices, though.

If you want to try MKL, see https://github.com/JuliaLang/julia#intel-compilers-and-math-kernel-libraries

--Tim

On Friday, February 20, 2015 06:49:50 AM Zhixuan Yang wrote:
> Mauro, Sean, and Tim, thanks for your help.
>
> Following your suggestions, I removed keyword arguments and split the
> function to avoid conditional statements. These helped a bit.
>
> But I got a surprising result after replacing BLAS functions with simple
> for loops, for loops is about 1.5x faster than BLAS calls. My Julia is
> compiled on my computer with the default configuration (the versioninfo()
> is listed below). Do you think it will help to compile a Julia with a
> faster and more optimized BLAS implementation such as Intel's MKL?
>
> Julia Version 0.3.6-pre+70
> Commit 638fa02 (2015-02-12 13:59 UTC)
> Platform Info:
> System: Darwin (x86_64-apple-darwin14.1.0)
> CPU: Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz
> WORD_SIZE: 64
> BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
> LAPACK: libopenblas
> LIBM: libopenlibm
> LLVM: libLLVM-3.3
>
>
> Regards, Yang Zhixuan
>
> 在 2015年2月19日星期四 UTC+8下午10:51:20，Zhixuan Yang写道：
>
> > Hello everyone,
> >
> > Recently I'm working on my first Julia project, a word embedding training

> > program similar to Google's word2vec <https://code.google.com/p/word2vec/>

> > <https://github.com/yangzhixuan/embed>. Could you please check my code

Viral Shah

unread,

Feb 21, 2015, 1:23:37 AM2/21/15

to julia...@googlegroups.com

So, where is the performance now compared to the C program? I don't think MKL will give you much if you were on the order of 100x slower to start with.

-viral

Zhixuan Yang

unread,

Feb 21, 2015, 2:53:56 AM2/21/15

to julia...@googlegroups.com

After recompiled an native arch version of Julia and OpenBLAS, it's about 8x slower than the C code and I think it's near to the highest performance my code can achieve. After all, the C code was optimized intensively in the cache level and all loops were unrolled. But my Julia code is much more flexible and extensible.

Maybe I should try to use more computers. Currently my code is paralleled by using pmap(). I hope the communication overhead will not be a new bottleneck if I run on a local network cluster.

Thanks for your help!

Regards, Yang Zhixuan

在 2015年2月21日星期六 UTC+8下午2:23:37，Viral Shah写道：

Luke Stagner

unread,

Feb 21, 2015, 3:12:40 AM2/21/15

to julia...@googlegroups.com

You may want to try using a profiler. I recently used the ProfileView.jl package to great success.

Viral Shah

unread,

Feb 21, 2015, 3:37:37 AM2/21/15

to julia...@googlegroups.com

This does look like a nice benchmark. I would love to see what it takes to narrow down the gap further. Playing around with it now. Perhaps the threads branch is also worth a shot.

-viral

Mauro

unread,

Feb 21, 2015, 7:18:39 AM2/21/15

to julia...@googlegroups.com

> After all, the C code was optimized intensively in the cache level and
> all loops were unrolled.

Julia is good at unrolling loops using marcos.

>>>> <https://code.google.com/p/word2vec/> (the code of word2vec is indeed

>>>> <https://github.com/yangzhixuan/embed>. Could you please check my code

Tim Holy

unread,

Feb 21, 2015, 9:49:57 AM2/21/15

to julia...@googlegroups.com

Just to check, in writing out your own version of gemv! you're using @inbounds
@simd, right?

The @nexprs macro (documented in the Base.Cartesian section of the manual)
lets you unroll loops manually. Also, see the (currently alpha) KernelTools.jl
repository for some ideas about improving cache efficiency---perhaps the @tile
macro will help.

--Tim

Zhixuan Yang

unread,

Feb 21, 2015, 10:11:24 AM2/21/15

to julia...@googlegroups.com

I passed "--check-bounds=no" to julia when launching the REPL to avoid writing @inbounds explicitly.

Simply adding @simd before for loops doesn't seem to be helping, it slightly slows down the code.

@nexprs is exactly what I need to avoid redundant code when unrolling loops. I will use this to simplify the code.

I will see KernelTools.jl later.

At this time, my Julia code runs roughly 3.5x slower than the C code and I'm pleased with this. I'd be glad to move on to adding more features I want at first. :-)

Thanks, Yang Zhixuan

在 2015年2月21日星期六 UTC+8下午10:49:57，Tim Holy写道：

michae...@gmail.com

unread,

Feb 21, 2015, 11:39:53 AM2/21/15

to julia...@googlegroups.com

What's the type of c.outputs? In train_one it seems to be Int64, in prdict! it seems to be Float64.

Viral Shah

unread,

Feb 21, 2015, 12:30:10 PM2/21/15

to julia...@googlegroups.com

It is also worth trying out one of the 0.4-dev nightlies and compare the performance. The code does avoid creating temporaries to a large extent, but it may be worth checking if the new GC helps.

-viral

Zhixuan Yang

unread,

Feb 23, 2015, 2:15:04 AM2/23/15

to julia...@googlegroups.com

Type of c.outputs is Vector{Float64}

在 2015年2月22日星期日 UTC+8上午12:39:53，michae...@gmail.com写道：

Zhixuan Yang

unread,

Feb 23, 2015, 2:21:29 AM2/23/15

to julia...@googlegroups.com

In the current version of my function, not a single temporary array is created. And the portion of time used by GC is less than 5% (reported by the @time macro). So I think the new GC algorithm will not help too much.

在 2015年2月22日星期日 UTC+8上午1:30:10，Viral Shah写道：

Zhixuan Yang

unread,

Feb 23, 2015, 2:24:03 AM2/23/15

to julia...@googlegroups.com

The following code is my current version of train_one():

function train_one(c :: LinearClassifier, x :: Array{Float64}, y :: Int64, input_gradient :: Array{Float64}, α :: Float64 = 0.025)


    predict!(c, x)
    c.outputs[y] -= 1




    # input_gradient = ( c.weights * outputs' )'
    # BLAS.gemv!('N', α, c.weights, c.outputs, 1.0, input_gradient)
    m = 0.0
    j = 0
    limit = c.n - 4
    for i in 1:c.k
        m = α * c.outputs[i]
        j = 1
        while j <= limit
            @nexprs 4 (idx->input_gradient[j+idx-1] += m * c.weights[j+idx-1, i])
            j+=4
        end
        while j <= c.n
            input_gradient[j] += m * c.weights[j, i]
            j+=1
        end


    end


    # c.weights -= α * x' * outputs;


    # BLAS.ger!(-α, vec(x), c.outputs, c.weights)
    for i in 1:c.k
        m = α * c.outputs[i]
        j = 1
        while j <= limit
            c.weights[j, i] -= m * x[j]
            c.weights[j+1, i] -= m * x[j+1]
            c.weights[j+2, i] -= m * x[j+2]
            c.weights[j+3, i] -= m * x[j+3]
            j+=4
        end
        while j <= c.n
            c.weights[j, i] -= m * x[j]
            j+=1
        end
    end
end

在 2015年2月19日星期四 UTC+8下午10:51:20，Zhixuan Yang写道：

Hello everyone,

Kuba Roth

unread,

Feb 23, 2015, 2:27:55 AM2/23/15

to julia...@googlegroups.com

Sorry for a off topic question.
@Viral. Is there any discussion going on about new GC you mentioned? Can you point it to me please? I wonder what are the changes?
Thnaks.

Simon Danisch

unread,

Feb 23, 2015, 3:01:16 AM2/23/15

to julia...@googlegroups.com

How does it compare to c by now?

Zhixuan Yang

unread,

Feb 23, 2015, 3:03:31 AM2/23/15

to julia...@googlegroups.com

About 3.5x slower.

在 2015年2月23日星期一 UTC+8下午4:01:16，Simon Danisch写道：

Sean Marshallsay

unread,

Feb 23, 2015, 6:03:04 AM2/23/15

to julia...@googlegroups.com

Kuba, most of the relevant discussion was happening here:

https://github.com/JuliaLang/julia/pull/8699

Reply all

Reply to author

Forward