# Perplexing performance problem

365 views

### Matthew Clegg

Sep 6, 2012, 10:46:01 PM9/6/12
Greetings Julia Gurus:

I've just started using Julia.  Nice language.  I'm really impressed with the thought that has been put into it.

I am having a problem, and I was hoping someone could offer some guidance.  I can't figure out why my Julia code
is not running faster.  I am probably overlooking something simple, but I don't know what.

Here is my code:

% cat llst.jl
# log likelihood of a student's t distribution with arbitrary location and scale
function llst(A::Array{Float64}, mu::Float64, sigma::Float64, df::Float64)
n = numel(A)
c = n * (lgamma((df+1.)/2.) - 0.5 * log(df * pi) - lgamma(df/2.) - log(sigma))
# No difference found in execution time between reduce(f...) and sum(map(f...))
#    s = -((df + 1.) / 2.) * reduce((ll, x) -> ll+log1p(((x - mu)/sigma)^2/df), 0., A)
s = -((df + 1.) / 2.) * sum(map(x -> log1p(((x - mu)/sigma)^2/df), A))
s + c
end

const vec = [1:1000]*.002
println("llst(vec,1.,2.,3.) = ", llst(vec,1.,2.,3.))
@time for i = 1:1000 llst(vec,1.,2.,3.) end

##

When I run this code on my computer, which is an ancient Mac Pro with 2.66 GHz Dual-core
Xeon processors running Mac OS 10.7.4, I get the following:

% julia
_
_       _ _(_)_     |
(_)     | (_) (_)    |
_ _   _| |_  __ _   |  A fresh approach to technical computing
| | | | | | |/ _` |  |
| | |_| | | | (_| |  |  Version 0.0.0+95599289.r17c9
_/ |\__'_|_|_|\__'_|  |  Commit 17c9d27eaa (2012-09-02 11:00:35)
|__/                   |

llst(vec,1.,2.,3.) = -1748.2553729185386
elapsed time: 0.5310850143432617 seconds

##

For comparison, here is the original R code:

llst <- function(X, mu, sigma, df) {
# Computes the log likelihood that (X - mu)/sigma is distributed
# according to a t-distribution with df degrees of freedom.

N <- length(X);
C <- N * (lgamma((df+1)/2) - 0.5*log(df * pi) - lgamma(df/2) - log(sigma));
S <- -((df + 1)/2) * sum(log(1 + ((X-mu)/sigma)^2/df));
S+C
}

vec <- (1:1000)*0.002
cat(sprintf("llst(vec,1,2,3) = %f\n", llst(vec,1,2,3)))
print(system.time(for (i in 1:1000) llst(vec,1,2,3)))

##

When I run the R code, this is the output that I receive:

% R

R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> source("llst.R")
llst(vec,1,2,3) = -1748.255373
user  system elapsed
0.086   0.011   0.096

##

As you can see from the above, the Julia code runs in 0.531
elapsed seconds, whereas the equivalent R code runs in 0.096
elapsed seconds.  Why is that?

To try to get a better handle on this, I constructed the following
simplified example.  The simplified code computes a random
vector of length N, and then sums it N times (for a total of N^2
additions).  This is done for n=256,512,1024,..., 65,536.  See
below.

% cat simple.jl
function simple(n::Int64)
vec = [rand() for i=1:n]
s = [sum(vec) for i=1:n]
end
for i=8:16 @time simple(2^i) end

% julia
_
_       _ _(_)_     |
(_)     | (_) (_)    |
_ _   _| |_  __ _   |  A fresh approach to technical computing
| | | | | | |/ _` |  |
| | |_| | | | (_| |  |  Version 0.0.0+95599289.r17c9
_/ |\__'_|_|_|\__'_|  |  Commit 17c9d27eaa (2012-09-02 11:00:35)
|__/                   |

elapsed time: 0.015284061431884766 seconds
elapsed time: 0.0011169910430908203 seconds
elapsed time: 0.004406929016113281 seconds
elapsed time: 0.0424809455871582 seconds
elapsed time: 0.06986188888549805 seconds
elapsed time: 0.2790408134460449 seconds
elapsed time: 1.1160101890563965 seconds
elapsed time: 4.4562668800354 seconds
elapsed time: 17.841228008270264 seconds

##

Now, here is the same code in R:

% cat simple.R
simple <- function(n) {
vec <- as.double(runif(n))
s <- sapply(1:n, function(k) sum(vec))
}
for (i in 8:16) print(system.time(simple(2^i)))

% R

R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> source("simple.R")
user  system elapsed
0.001   0.000   0.001
user  system elapsed
0.002   0.000   0.002
user  system elapsed
0.004   0.000   0.004
user  system elapsed
0.012   0.000   0.013
user  system elapsed
0.042   0.000   0.042
user  system elapsed
0.151   0.000   0.150
user  system elapsed
0.562   0.002   0.565
user  system elapsed
2.165   0.002   2.169
user  system elapsed
8.500   0.006   8.507

##

In this example, the advantage of R over Julia has decreased, but it is
still winning by a factor of 2.

Any ideas about what may explain this?  More importantly, what can I

Matthew Clegg

### Tim Holy

Sep 7, 2012, 5:50:10 AM9/7/12
Hi Matthew,

Welcome to Julia!

s = -((df + 1.) / 2.) * sum(log1p(((A-mu)/sigma).^2/df))

I think you'll be a bit happier (I get a 4x performance boost).

For your second example, I noticed that the time includes random number
generation. I can't speak to whether R and Julia use the same algorithm, but
that seems a bit unlikely. Because different rngs differ in their quality and
speed, and there is sometimes a tradeoff, I would generally not include rng
timing in performance evaluations (unless that was the topic I was focusing
on).

Best,
--Tim

### Tom Short

Sep 7, 2012, 10:09:48 AM9/7/12
On Fri, Sep 7, 2012 at 5:50 AM, Tim Holy <tim....@gmail.com> wrote:
>
> s = -((df + 1.) / 2.) * sum(log1p(((A-mu)/sigma).^2/df))

When I run Tim's version, it's a little faster than the R version (by
about 7%) if I gc() both before running. If you unroll the loop, Julia
is more than twice as fast as the R version. Here's the unrolled
version:

# log likelihood of a student's t distribution with arbitrary location and scale
function llst3(A::Array{Float64}, mu::Float64, sigma::Float64, df::Float64)
n = numel(A)
c = n * (lgamma((df+1.)/2.) - 0.5 * log(df * pi) - lgamma(df/2.) -
log(sigma))
z = 0.0
for i in 1:n
z += log1p(((A[i]-mu)/sigma).^2/df)
end
s = -((df + 1.) / 2.) * z
s + c
end
println("llst3(vec,1.,2.,3.) = ", llst3(vec,1.,2.,3.))
gc(); @time for i = 1:1000 llst3(vec,1.,2.,3.) end

### Jeff Bezanson

Sep 7, 2012, 11:22:15 AM9/7/12
Yes unfortunately map() is still slow. This will be addressed.

Our sum() may also be slower because we use a more accurate algorithm.
In fact we're currently debating whether to keep it or sell out for
performance like everybody else.
> --
>
>
>

### Stefan Karpinski

Sep 7, 2012, 11:30:03 AM9/7/12
On Fri, Sep 7, 2012 at 11:22 AM, Jeff Bezanson wrote:

Our sum() may also be slower because we use a more accurate algorithm.
In fact we're currently debating whether to keep it or sell out for
performance like everybody else.

For those not following every development conversation:

It's been a rather long-running conversation — summation is pretty fundamental, don't want to get it "wrong" (sometimes getting it right does not mean "correct").

### Matthew Clegg

Sep 10, 2012, 11:08:33 AM9/10/12

I thought I would summarize the results that I obtained when trying the various approaches.  As you recall, the original code was as follows:

# log likelihood of a student's t distribution with arbitrary location and scale
function llst(A::Array{Float64}, mu::Float64, sigma::Float64, df::Float64)
n = numel(A)
c = n * (lgamma((df+1.)/2.) - 0.5 * log(df * pi) - lgamma(df/2.) - log(sigma))
s = -((df + 1.) / 2.) * sum(map(x -> log1p(((x - mu)/sigma)^2/df), A))  # **
s + c
end

Let's call this the "map" version.  Any hope for a performance gain will require changing the line that I have marked with "# **".  The following alternatives were considered:

"reduce" version:
s = -((df + 1.) / 2.) * reduce((ll, x) -> ll+log1p(((x - mu)/sigma)^2/df), 0., A)

"vectorized" version (suggested by Tim):
s = -((df + 1.) / 2.) * sum(log1p(((A - mu)/sigma).^2/df))

"list comprehension" version:
s = -((df + 1.) / 2.) * sum([log1p(((A[i] - mu)/sigma)^2/df) for i=1:length(A)])

"loop unrolling" version (suggested by Tom Short):
z = 0.0
for i in 1:n
z += log1p(((A[i] - mu)/sigma)^2/df)
end
s = -((df + 1.) / 2.) * z

Here are the execution times of each of these implementations on my machine (recall, I previously measured the execution time of the equivalent R code at 0.096 seconds):

map                  elapsed time: 0.5343310832977295 seconds
reduce               elapsed time: 0.48698997497558594 seconds
vectorized           elapsed time: 0.08964395523071289 seconds
list comprehension   elapsed time: 0.061167001724243164 seconds
loop unrolling       elapsed time: 0.056635141372680664 seconds

Thanks to Tom Short's suggestion, I was able to achieve a significant improvement over my initial implementation.

It would certainly be nice if map and reduce could be sped up, as these are two of my favorite idioms.

It was also very interesting to read the discussion about the implementation of sum().  I would vote for making the default implementation fast, while giving the user the option of choosing the slower but more precise implementation.

My own situation is that I have been doing a lot of R programming for the last several years.  Two of my chief complaints about R are that the language was poorly designed and it is very inefficient.  The design of Julia does a great job of addressing many of my complaints about R's language design, but this by itself is not enough to justify the daunting task of porting all of my code to Julia, as well as having to re-invent a few R packages along the way.  I would need to see a substantial performance improvement as well.  I would not be surprised if a lot of other R programmers feel the same way.  So, in my view, a (zealous) focus on performance by the Julia team would not be misplaced.

Please do not interpret the above as a criticism.  I think you have done a great job, and I hope you will keep up the good work.

Thanks again for the help with my code.

I have attached the code that was used to generate the above performance benchmarks.

llst.jl

### Viral Shah

Sep 10, 2012, 1:15:55 PM9/10/12
Thanks for the nice summary. This progression will make a nice addition to our performance benchmarks.

We also hope that effortless parallelism will also help make julia more compelling as we go ahead, even as the compiler gets better, and library functions become faster over time.

-viral

### Douglas Bates

Sep 11, 2012, 11:18:30 AM9/11/12
Is there a reason that you want to evaluate the log-likelihood from from the definition instead of a function like qt in R?  I would have written the log-likelihood function as

lls1 <- function(X, mu, sigma, df) sum(dt((X-mu)/sigma, df, log=TRUE))

but some experimentation shows that it gives a different answer than your code and also takes longer, which I find surprising.

In Julia I would have written

function llst(X, mu::Real, sigma::Real, df::Real)
dd = TDist(df)
ss = 0.
for x in X ss += logpdf(dd, (x-mu)/sigma) end
ss
end

which, on my machine, is roughly equivalent to the R version in timing and produces the same answer, -1055.1081923585919.

This equivalence is not surprising because the logpdf function for a T distribution uses the same code as does R for the low-level evaluation.

It is not easy to see how the calculation is actually being done julia/deps/Rmath/src/dt.c because that code is a twisty maze of mulit-line C macros.  One thing you can notice is that it uses different code according to whether ((x-mu)/sigma)^2 > 0.2* df.

### Matthew Clegg

Sep 11, 2012, 3:58:02 PM9/11/12

On Tuesday, September 11, 2012 11:18:30 AM UTC-4, Douglas Bates wrote:
Is there a reason that you want to evaluate the log-likelihood from from the definition instead of a function like qt in R?  I would have written the log-likelihood function as

lls1 <- function(X, mu, sigma, df) sum(dt((X-mu)/sigma, df, log=TRUE))

but some experimentation shows that it gives a different answer than your code and also takes longer, which I find surprising.

Please correct me if I am wrong, but I think you may have left out a 1/sigma constant.  If p(x) is a probability density on x, and y = f(x), then shouldn't the probability density on y be p(f(y))f'(y)?  In our case, f'(y) = 1/sigma.  So I believe the proper function definition should be:

lls2 <- function(X, mu, sigma, df) sum(dt((X-mu)/sigma, df, log=TRUE)) - length(X)*log(sigma)

After making this change, you'll find that the functions agree:

> llst(vec, 1, 2, 3)
[1] -1748.255
> lls1(vec, 1, 2, 3)
[1] -1055.108
> lls2(vec, 1, 2, 3)
[1] -1748.255

You can check that this is the right density function (or at least reasonable) by integrating:

> dst <- function(X, mu, sigma, df) dt((X-mu)/sigma, df)/sigma
> integrate(dst, -20, 20, 1, 2, 3)
0.9978412 with absolute error < 5.7e-07

In Julia I would have written

function llst(X, mu::Real, sigma::Real, df::Real)
dd = TDist(df)
ss = 0.
for x in X ss += logpdf(dd, (x-mu)/sigma) end
ss
end

which, on my machine, is roughly equivalent to the R version in timing and produces the same answer, -1055.1081923585919.

This equivalence is not surprising because the logpdf function for a T distribution uses the same code as does R for the low-level evaluation.

It is not easy to see how the calculation is actually being done julia/deps/Rmath/src/dt.c because that code is a twisty maze of mulit-line C macros.  One thing you can notice is that it uses different code according to whether ((x-mu)/sigma)^2 > 0.2* df.

Why is my code faster?  Certainly in Julia, one part of the answer is that I am computing the normalization constant only once, whereas repeatedly calling logpdf causes it to be recomputed for each element of X, and this is the major cost of evaluating the t density.  (This is another advertisement for the benefits of memoization, although the memoization would need to occur deep in the internals of the code.)

Your message caused me to look for the first time at the actual R source code for the implementation of the t-distribution, and I was surprised to find that it also seems to repeatedly recompute the normalization constant for each of the elements of the vector passed to dt(X, df).  (In other words, dt() is not itself vectorized.)  Perhaps this is the entire explanation for the speedup that I get.

However, it's also the case that the manner in which dt is computed in R certainly does not match up with a straightforward translation of the t-distribution from a college textbook.  In fact, my first reaction to this code was ... She did what!?! :)

I hope this clears things up a bit ...

### Douglas Bates

Sep 12, 2012, 10:23:48 AM9/12/12
On Tuesday, September 11, 2012 2:58:02 PM UTC-5, Matthew Clegg wrote:
On Tuesday, September 11, 2012 11:18:30 AM UTC-4, Douglas Bates wrote:
Is there a reason that you want to evaluate the log-likelihood from from the definition instead of a function like qt in R?  I would have written the log-likelihood function as

lls1 <- function(X, mu, sigma, df) sum(dt((X-mu)/sigma, df, log=TRUE))

but some experimentation shows that it gives a different answer than your code and also takes longer, which I find surprising.

Please correct me if I am wrong, but I think you may have left out a 1/sigma constant.  If p(x) is a probability density on x, and y = f(x), then shouldn't the probability density on y be p(f(y))f'(y)?  In our case, f'(y) = 1/sigma.  So I believe the proper function definition should be:

lls2 <- function(X, mu, sigma, df) sum(dt((X-mu)/sigma, df, log=TRUE)) - length(X)*log(sigma)

You're correct, of course.  Rather embarrassing for a Statistics professor not to realize that.  I'm retired now so perhaps when I first learned probability densities behaved differently :-)

After making this change, you'll find that the functions agree:

> llst(vec, 1, 2, 3)
[1] -1748.255
> lls1(vec, 1, 2, 3)
[1] -1055.108
> lls2(vec, 1, 2, 3)
[1] -1748.255

You can check that this is the right density function (or at least reasonable) by integrating:

> dst <- function(X, mu, sigma, df) dt((X-mu)/sigma, df)/sigma
> integrate(dst, -20, 20, 1, 2, 3)
0.9978412 with absolute error < 5.7e-07

In Julia I would have written

function llst(X, mu::Real, sigma::Real, df::Real)
dd = TDist(df)
ss = 0.
for x in X ss += logpdf(dd, (x-mu)/sigma) end
ss
end

which, on my machine, is roughly equivalent to the R version in timing and produces the same answer, -1055.1081923585919.

This equivalence is not surprising because the logpdf function for a T distribution uses the same code as does R for the low-level evaluation.

It is not easy to see how the calculation is actually being done julia/deps/Rmath/src/dt.c because that code is a twisty maze of mulit-line C macros.  One thing you can notice is that it uses different code according to whether ((x-mu)/sigma)^2 > 0.2* df.

Why is my code faster?  Certainly in Julia, one part of the answer is that I am computing the normalization constant only once, whereas repeatedly calling logpdf causes it to be recomputed for each element of X, and this is the major cost of evaluating the t density.  (This is another advertisement for the benefits of memoization, although the memoization would need to occur deep in the internals of the code.)

Exactly.

Your message caused me to look for the first time at the actual R source code for the implementation of the t-distribution, and I was surprised to find that it also seems to repeatedly recompute the normalization constant for each of the elements of the vector passed to dt(X, df).  (In other words, dt() is not itself vectorized.)  Perhaps this is the entire explanation for the speedup that I get.

Well it would certainly help.  Good work on seeing that.  My analysis was too superficial.  I like the julia-dev list because there are so many smart and knowledgeable people contributing and I get to learn a lot.

However, it's also the case that the manner in which dt is computed in R certainly does not match up with a straightforward translation of the t-distribution from a college textbook.  In fact, my first reaction to this code was ... She did what!?! :)

As I said, a twisty maze of multi-line C macros - a style of programming that I find challenging to understand.  All this is for the purpose of squeezing the last ounce of performance out of code that, as you show, is a bad design in the first place because the bulk of the computation for each value in a vector is needlessly repeated.

We could perhaps consider additional methods for the logpdf generic from extras/distributions.jl (another file that I will convert to a module if no one gets there before me) that vectorizes more intelligently when warranted.

### Matthew Clegg

Sep 13, 2012, 9:03:57 AM9/13/12

On Wednesday, September 12, 2012 10:23:48 AM UTC-4, Douglas Bates wrote:
On Tuesday, September 11, 2012 2:58:02 PM UTC-5, Matthew Clegg wrote:

Your message caused me to look for the first time at the actual R source code for the implementation of the t-distribution, and I was surprised to find that it also seems to repeatedly recompute the normalization constant for each of the elements of the vector passed to dt(X, df).  (In other words, dt() is not itself vectorized.)  Perhaps this is the entire explanation for the speedup that I get.

Well it would certainly help.  Good work on seeing that.  My analysis was too superficial.  I like the julia-dev list because there are so many smart and knowledgeable people contributing and I get to learn a lot.

I too am impressed by what I have seen so far of the Julia community.  There seems to be a genuine willingness to help and an openness to new ideas.  Also, it seems like there might be a lot of scope for interesting user contributions.

However, it's also the case that the manner in which dt is computed in R certainly does not match up with a straightforward translation of the t-distribution from a college textbook.  In fact, my first reaction to this code was ... She did what!?! :)

As I said, a twisty maze of multi-line C macros - a style of programming that I find challenging to understand.  All this is for the purpose of squeezing the last ounce of performance out of code that, as you show, is a bad design in the first place because the bulk of the computation for each value in a vector is needlessly repeated.

We could perhaps consider additional methods for the logpdf generic from extras/distributions.jl (another file that I will convert to a module if no one gets there before me) that vectorizes more intelligently when warranted.

In terms of additional methods for distributions.jl, I guess I would vote for a log likelihood method.  I have seen many problems that eventually boil down to defining the proper likelihood function and then maximizing it.  Because computing a maximum likelihood fit involves a sum, there is a potential to eliminate the creation of an intermediate vector of (log) density values and to compute the sum in a tight inner loop.

However, I think an even better approach would be to go down to the C code and, where appropriate, explicitly vectorize the various d..., p..., q..., and r... functions, and perhaps add a family of ll... functions.  It's conceivable that the trick of factoring out the normalization constant could be used in a few other cases, and vectorizing the C code would greatly reduce the number of calls from Julia to C, which are probably more costly than calling from C to C.  Perhaps other optimizations would reveal themselves as well.

By my count, there are 21 distributions in Rmath, so a total of 84 functions would potentially need to be vectorized.  Although I would be happy to contribute, I think this is too big a project for me to take on by myself at this point, and my background is not in numerical analysis, nor in statistics.

I think I might approach the problem as follows.  I would create a new C library, say VRmath.  For each of the functions d..., p..., q..., r... in Rmath, there would be a vectorized version, vd..., vp..., vq..., vr....  There would be a standardized calling interfaces
v{d,p,q}...(ResultVec, VectorLen, ArgumentVec, Param1, Param2, ...), and
vr...(ResultVec, VectorLen, Param1, Param2, ...)

The initial implementation for each would simply be a macro that repeatedly calls the corresponding routine from Rmath.  This would quickly give a functioning library, which might immediately result in a performance improvement, due to the reduced number of calls between Julia and C.  Over time, individual routines from Rmath could be replaced with customized vectorized versions.  In a best case scenario, many functions pertaining to the core statistical distributions could end up running significantly faster in Julia than in R itself.

If you like the idea and you want to move forward on it, send me an email and we can discuss it further.

### Stefan Karpinski

Sep 13, 2012, 10:02:02 AM9/13/12
On Thu, Sep 13, 2012 at 9:03 AM, Matthew Clegg wrote:

I too am impressed by what I have seen so far of the Julia community.  There seems to be a genuine willingness to help and an openness to new ideas.

It's a really great community :-)

Also, it seems like there might be a lot of scope for interesting user contributions.

The biggest "surprise" benefit of addressing the two language problem — i.e. the standard performance/usability compromise of using a slow high-level language for the logic and a fast low-level language for the heavy lifting — is that having both high-level and low-level code in one language reduces the social barrier to contributing to the internals of the language. All users become potential contributors, which ends up accelerating system development by a really unexpected amount.

However, I think an even better approach would be to go down to the C code and, where appropriate, explicitly vectorize the various d..., p..., q..., and r... functions, and perhaps add a family of ll... functions.  It's conceivable that the trick of factoring out the normalization constant could be used in a few other cases, and vectorizing the C code would greatly reduce the number of calls from Julia to C, which are probably more costly than calling from C to C.  Perhaps other optimizations would reveal themselves as well.

There is actually no overhead to calling C function from Julia — it's identical to calling a C function from C, except, of course, that the compiler doesn't have the opportunity to inline the function body. The inlining issue is one of the reasons it's good to have as much functionality as possible implemented in Julia: that way it can get inlined for even better performance (it's not just eliminating the function call overhead that's good, this also often exposes additional optimization opportunities like dead code elimination, etc.).