Benchmarking study: C++ < Fortran < Numba < Julia < Java < Matlab < the rest

3,275 views
Skip to first unread message

Florian Oswald

unread,
Jun 16, 2014, 11:52:07 AM6/16/14
to julia...@googlegroups.com
Dear all,

I thought you might find this paper interesting: http://economics.sas.upenn.edu/~jesusfv/comparison_languages.pdf

It takes a standard model from macro economics and computes it's solution with an identical algorithm in several languages. Julia is roughly 2.6 times slower than the best C++ executable. I was bit puzzled by the result, since in the benchmarks on http://julialang.org/, the slowest test is 1.66 times C. I realize that those benchmarks can't cover all possible situations. That said, I couldn't really find anything unusual in the Julia code, did some profiling and removed type inference, but still that's as fast as I got it. That's not to say that I'm disappointed, I still think this is great. Did I miss something obvious here or is there something specific to this algorithm? 

The codes are on github at 



John Myles White

unread,
Jun 16, 2014, 11:55:50 AM6/16/14
to julia...@googlegroups.com
Maybe it would be good to verify the claim made at https://github.com/jesusfv/Comparison-Programming-Languages-Economics/blob/master/RBC_Julia.jl#L9

I would think that specifying all those types wouldn’t matter much if the code doesn’t have type-stability problems.

— John

Dahua Lin

unread,
Jun 16, 2014, 12:13:44 PM6/16/14
to julia...@googlegroups.com
First, I agree with John that you don't have to declare the types in general, like in a compiled language. It seems that Julia would be able to infer the types of most variables in your codes.

There are several ways that your code's efficiency may be improved:

(1) You can use @inbounds to waive bound checking in several places, such as line 94 and 95 (in RBC_Julia.jl)
(2) Line 114 and 116 involves reallocating new arrays, which is probably unnecessary. Also note that Base.maxabs can compute the maximum of absolute value more efficiently than maximum(abs( ... ))

In terms of measurement, did you pre-compile the function before measuring the runtime?

A side note about code style. It seems that it uses a lot of Java-ish descriptive names with camel case. Julia practice tends to encourage more concise naming.

Dahua

Florian Oswald

unread,
Jun 16, 2014, 12:21:36 PM6/16/14
to julia...@googlegroups.com
Hi guys,

thanks for the comments. Notice that I'm not the author of this code [so variable names are not on me :-) ] just tried to speed it up a bit. In fact, declaring types before running the computation function and using @inbounds made the code 24% faster than the benchmark version. here's my attempt


should try the Base.maxabs.

in profiling this i found that a lot of time is spent here:


which i'm not sure how to avoid.

Stefan Karpinski

unread,
Jun 16, 2014, 12:48:53 PM6/16/14
to Julia Users
That's an interesting comparison. Being on par with Java is quite respectable. There's nothing really obvious to change with that code and it definitely doesn't need so many type annotations – if the annotations do improve the performance, it's possible that there's a type instability somewhere without the annotation. The annotation would avoid the instability, but by converting, but conversion itself can be expensive.

Andreas Noack Jensen

unread,
Jun 16, 2014, 1:02:55 PM6/16/14
to julia...@googlegroups.com
I think that the log in openlibm is slower than most system logs. On my mac, if I use

mylog(x::Float64) = ccall((:log, "libm"), Float64, (Float64,), x)

the code runs 25 pct. faster. If I also use @inbounds and devectorise the max(abs) it runs in 2.26 seconds on my machine. The C++ version with the XCode compiler and -O3 runs in 1.9 seconds.
--
Med venlig hilsen

Andreas Noack Jensen

Stefan Karpinski

unread,
Jun 16, 2014, 1:08:40 PM6/16/14
to Julia Users
Doing the math, that makes that optimized Julia version 18% slower than C++, which is fast indeed.

Florian Oswald

unread,
Jun 16, 2014, 1:17:12 PM6/16/14
to julia...@googlegroups.com
interesting!
just tried that - I defined mylog inside the computeTuned function 


but that actually slowed things down considerably. I'm on a mac as well, but it seems that's not enough to compare this? or where did you define this function?

Stefan Karpinski

unread,
Jun 16, 2014, 1:21:02 PM6/16/14
to Julia Users
Different systems have quite different libm implementations, both in terms of speed and accuracy, which is why we have our own. It would be nice if we could get our log to be faster.

Tim Holy

unread,
Jun 16, 2014, 1:30:48 PM6/16/14
to julia...@googlegroups.com
From the sound of it, one possibility is that you made it a "private function"
inside the computeTuned function. That creates the equivalent of an anonymous
function, which is slow. You need to make it a generic function (define it
outside computeTuned).

--Tim

Stefan Karpinski

unread,
Jun 16, 2014, 1:32:50 PM6/16/14
to Julia Users

Jesus Villaverde

unread,
Jun 16, 2014, 4:54:13 PM6/16/14
to julia...@googlegroups.com
Hi

I am one of the authors of the paper :)

Our first version of the code did not declare types. It was thanks to Florian's suggestion that we started doing it. We discovered, to our surprise, that it reduced execution time by around 25%. I may be mistaken but I do not think there are type-stability problems. We have a version of the code that is nearly identical in C++ and we did not have any of those type problems.

Jesus Villaverde

unread,
Jun 16, 2014, 4:56:34 PM6/16/14
to julia...@googlegroups.com
Hi

1) Yes, we pre-compiled the function.

2) As I mentioned before, we tried the code with and without type declaration, it makes a difference.

3) The variable names turns out to be quite useful because this code will be eventually nested into a much larger project where it is convenient to have very explicit names.

Thanks 

Jesus Villaverde

unread,
Jun 16, 2014, 5:59:31 PM6/16/14
to julia...@googlegroups.com
Also, defining

mylog(x::Float64) = ccall((:log, "libm"), Float64, (Float64,), 
x)

made quite a bit of difference for me, from 1.92 to around 1.55. If I also add @inbounds, I go down to 1.45, making Julia only twice as sslow as C++. Numba still beats Julia, which kind of bothers me a bit

Thanks for the suggestions.

Peter Simon

unread,
Jun 17, 2014, 12:03:29 AM6/17/14
to julia...@googlegroups.com
By a process of elimination, I determined that the only variable whose declaration affected the run time was vGridCapital.  The variable is declared to be of type Array{Float64,1}, but is initialized as


vGridCapital
= 0.5*capitalSteadyState:0.00001:1.5*capitalSteadyState

which, unlike in Matlab, produces a Range object, rather than an array.  If the line above is modified to

vGridCapital = [0.5*capitalSteadyState:0.00001:1.5*capitalSteadyState]

then the type instability is eliminated, and all type declarations can be removed with no effect on execution time.

--Peter

Stefan Karpinski

unread,
Jun 17, 2014, 12:20:43 AM6/17/14
to julia...@googlegroups.com
Ah! Excellent sleuthing. That's about the kind of thing I suspected was going on.

Florian Oswald

unread,
Jun 17, 2014, 4:20:05 AM6/17/14
to julia...@googlegroups.com
Hi Dahua,
I cannot find Base.maxabs (i.e. Julia says Base.maxabs not defined)

I'm here:

julia> versioninfo()
Julia Version 0.3.0-prerelease+2703
Commit 942ae42* (2014-04-22 18:57 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin12.5.0)
  CPU: Intel(R) Core(TM) i5-2435M CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libgfortblas
  LAPACK: liblapack
  LIBM: libopenlibm

cheers

Tomas Lycken

unread,
Jun 17, 2014, 4:30:00 AM6/17/14
to julia...@googlegroups.com
It seems Base.maxabs was added (by Dahua) as late as May 30 - https://github.com/JuliaLang/julia/commit/78bbf10c125a124bc8a1a25e8aaaea1cbc6e0ebc

If you update your Julia to the latest master, you'll have it =)

// T

Florian Oswald

unread,
Jun 17, 2014, 4:35:18 AM6/17/14
to julia...@googlegroups.com
hi tim - True!
(why on earth would I do that?)

defining it outside reproduces the speed gain. thanks!

Milan Bouchet-Valat

unread,
Jun 17, 2014, 4:53:41 AM6/17/14
to julia...@googlegroups.com
Le lundi 16 juin 2014 à 14:59 -0700, Jesus Villaverde a écrit :
> Also, defining
>
> mylog(x::Float64) = ccall((:log, "libm"), Float64, (Float64,), x)
>
> made quite a bit of difference for me, from 1.92 to around 1.55. If I
> also add @inbounds, I go down to 1.45, making Julia only twice as
> sslow as C++. Numba still beats Julia, which kind of bothers me a bit
Since Numba uses LLVM too, you should be able to compare the LLVM IR it
generates to that generated by Julia. Doing this at least for the tight
loop would be very interesting.


My two cents

Jesus Villaverde

unread,
Jun 17, 2014, 8:35:44 AM6/17/14
to julia...@googlegroups.com
Ahhhhh!!!!!!!! Sorry, over 20 years of coding in Matlab :(

Yes, you are right, once I change that line, the type definition is irrelevant. We should change the paper and the code ASAP

Bruno Rodrigues

unread,
Jun 17, 2014, 8:50:02 AM6/17/14
to julia...@googlegroups.com
Hi Pr. Villaverde, just wanted to say that it was your paper that made me try Julia. I must say that I am very happy with the switch! Will you continue using Julia for your research?

Stefan Karpinski

unread,
Jun 17, 2014, 9:11:08 AM6/17/14
to julia...@googlegroups.com
Not your fault at all. We need to make this kind of thing easier to discover. Eg with

Jesus Villaverde

unread,
Jun 17, 2014, 9:23:57 AM6/17/14
to julia...@googlegroups.com
I think so! Matlab is just too slow for many things and a bit old in some dimensions. I often use C++ but for a lot of stuff, it is just to cumbersome.

Jesus Villaverde

unread,
Jun 17, 2014, 9:25:24 AM6/17/14
to julia...@googlegroups.com
Thanks! I'll learn those tools. In any case, paper updated online, github page with new commit. This is really great. Nice example of aggregation of information. Economists love that :)

Tony Kelman

unread,
Jun 17, 2014, 10:11:40 AM6/17/14
to julia...@googlegroups.com
Your matrices are kinda small so it might not make much difference, but it would be interesting to see whether using the Tridiagonal type could speed things up at all.

Cameron McBride

unread,
Jun 17, 2014, 10:19:23 AM6/17/14
to julia-users
Do any of the more initiated have an idea why Numba performs better for this application, as both it and Julia use LLVM?  I'm just asking out of pure curiosity. 

Cameron

Cristóvão Duarte Sousa

unread,
Jun 17, 2014, 12:21:54 PM6/17/14
to julia...@googlegroups.com
I've just done measurements of algorithm inner loop times in my machine by changing the code has shown in this commit.

I've found out something... see for yourself:

using Winston
numba_times
= readdlm("numba_times.dat")[10:end];
plot
(numba_times)

julia_times = readdlm("julia_times.dat")[10:end];
plot
(julia_times)

println((median(numba_times), mean(numba_times), var(numba_times)))
(0.0028225183486938477,0.0028575707378805993,2.4830103817464292e-8)

println((median(julia_times), mean(julia_times), var(julia_times)))
(0.0028240440000000004,0.0034863882123824454,1.7058255003790299e-6)

So, while inner loop times have more or less the same median on both Julia and Numba tests, the mean and variance are higher in Julia.

Can that be due to the garbage collector being kicking in?

Stefan Karpinski

unread,
Jun 17, 2014, 12:31:51 PM6/17/14
to Julia Users
That definitely smells like a GC issue. Python doesn't have this particular problem since it uses reference counting.

John Myles White

unread,
Jun 17, 2014, 12:32:34 PM6/17/14
to julia...@googlegroups.com
Sounds like we need to rerun these benchmarks after the new GC branch gets updated.

 -- John

Cristóvão Duarte Sousa

unread,
Jun 17, 2014, 12:55:08 PM6/17/14
to julia...@googlegroups.com
Are you talking about the incremental GC?

It happens that, since I'm making some experiments with a (pseudo-)realtime simulation with Julia, I also have that branch compiled.
In my realtime experiment, at the activation of a Timer with a period of 2.2ms, I get a big delay of  +/-9ms each +/-1sec. when using master Julia.
By using the incremental GC those delays disappear.

However, in the time measurements I described before, the use of the incremental GC doesn't seem to produce any better results...

Peter Simon

unread,
Jun 17, 2014, 12:57:00 PM6/17/14
to julia...@googlegroups.com
As pointed out by Dahua, there is a lot of unnecessary memory allocation.  This can be reduced significantly by replacing the lines

        maxDifference  = maximum(abs(mValueFunctionNew-mValueFunction))
        mValueFunction    
= mValueFunctionNew
        mValueFunctionNew
= zeros(nGridCapital,nGridProductivity)




with

        maxDifference  = maximum(abs!(subtract!(mValueFunction, mValueFunctionNew)))
       
(mValueFunction, mValueFunctionNew) = (mValueFunctionNew, mValueFunction)
        fill
!(mValueFunctionNew, 0.0)



abs! and subtract! require adding the line

using NumericExtensions



prior to the function line.  I think the OP used Julia 0.2; I don't believe that NumericExtensions will work with that old version.  When I combine these changes with adding 

@inbounds begin
...
end



block around the "while" loop, I get about 25% reduction in execution time, and reduction of memory allocation from roughly 700 MByte to 180 MByte

--Peter

Andreas Noack Jensen

unread,
Jun 17, 2014, 1:05:03 PM6/17/14
to julia...@googlegroups.com
...but the Numba version doesn't use tricks like that. 

The uniform metric can also be calculated with a small loop. I think that requiring dependencies is against the purpose of the exercise.

Peter Simon

unread,
Jun 17, 2014, 1:25:47 PM6/17/14
to julia...@googlegroups.com
You're right.  Replacing the NumericExtensions function calls with a small loop

        maxDifference  = 0.0
       
for k = 1:length(mValueFunction)
            maxDifference
= max(maxDifference, abs(mValueFunction[k]- mValueFunctionNew[k]))
       
end


makes no significant difference in execution time or memory allocation and eliminates the dependency.

--Peter

Florian Oswald

unread,
Jun 17, 2014, 1:50:04 PM6/17/14
to julia...@googlegroups.com
thanks peter. I made that devectorizing change after dalua suggested so. It made a massive difference!

David Anthoff

unread,
Jun 17, 2014, 1:57:47 PM6/17/14
to julia...@googlegroups.com

I submitted three pull requests to the original repo that get rid of three different array allocations in loops and that make things a fair bit faster altogether:

 

https://github.com/jesusfv/Comparison-Programming-Languages-Economics/pulls

 

I think it would also make sense to run these benchmarks on julia 0.3.0 instead of 0.2.1, given that there have been a fair number of performance imrpovements.

Peter Simon

unread,
Jun 17, 2014, 3:08:18 PM6/17/14
to julia...@googlegroups.com
Sorry, Florian and David, for not seeing that you were way ahead of me.

On the subject of the log function:  I tried implementing mylog() as defined by Andreas on Julia running on CentOS and the result was a significant slowdown! (Yes, I defined the mylog function outside of main, at the module level).  Not sure if this is due to variation in the quality of the libm function on various systems or what.  If so, then it makes sense that Julia wants a uniformly accurate and fast implementation via openlibm.  But for fastest transcendental function performance, I assume that one must use the micro-coded versions built into the processor's FPU--Is that what the fast libm implementations do?  In that case, how could one hope to compete when using a C-coded version?

--Peter

David Anthoff

unread,
Jun 17, 2014, 3:25:50 PM6/17/14
to julia...@googlegroups.com

Another interesting result from the paper is how much faster Visual C++ 2010 generated code is than gcc, on Windows. For their example, the gcc runtime is 2.29 the runtime of the MS compiled version. The difference might be even larger with Visual C++ 2013 because that is when MS added an auto-vectorizer that is on by default.

 

I vaguely remember a discussion about compiling julia itself with the MS compiler on Windows, is that working and is that making a performance difference?

Tobias Knopp

unread,
Jun 17, 2014, 3:41:36 PM6/17/14
to julia...@googlegroups.com
There are some remaining issues but compilation with MSVC is almost possible. I did some initial work and Tony Kelman made lots of progress in https://github.com/JuliaLang/julia/pull/6230. But there have not been any speed comparisons as far as I know. Note that Julia uses JIT compilation and thus I would not expect to have the source compiler have a huge impact.

Tony Kelman

unread,
Jun 17, 2014, 3:44:33 PM6/17/14
to julia...@googlegroups.com

A couple of tiny changes aren't in master at the moment, but I was able to get libjulia compiled and julia.exe starting system image bootstrap. It hit a stack overflow at osutils.jl which is right after inference.jl, so the problem is likely in compiling type inference. Apparently I was missing some flags that are used in the MinGW build to increase the default stack size. Haven't gotten back to giving it another try recently.

David Anthoff

unread,
Jun 17, 2014, 3:47:49 PM6/17/14
to julia...@googlegroups.com

I was more thinking that this might make a difference for some of the dependencies, like openblas? But I’m not even sure that can be compiled at all using MS compilers…

Tony Kelman

unread,
Jun 17, 2014, 4:03:51 PM6/17/14
to julia...@googlegroups.com
We're diverging from the topic of the thread, but anyway...

No, MSVC OpenBLAS will probably never happen, you'd have to CMake-ify the whole thing and probably translate all of the assembly to Intel syntax. And skip the Fortran, or use Intel's compiler. I don't think they have the resources to do that.

There's a C99-only optimized BLAS implementation under development by the FLAME group at University of Texas here https://github.com/flame/blis that does aim to eventually support MSVC. It's nowhere near as mature as OpenBLAS in terms of automatically detecting architecture, cache sizes, etc. But their papers look very promising. They could use more people poking at it and submitting patches to get it to the usability level we'd need.

The rest of the dependencies vary significantly in how painful they would be to build with MSVC. GMP in particular was forked into a new project called MPIR, with MSVC compatibility being one of the major reasons.

Tobias Knopp

unread,
Jun 17, 2014, 4:24:21 PM6/17/14
to julia...@googlegroups.com
I think one has to distinguish between the Julia core dependencies and the runtime dependencies. The later (like OpenBlas) don't tell us much how fast "Julia" is. The libm issue discussed in this thread is of such a nature.

Jesus Villaverde

unread,
Jun 17, 2014, 6:03:35 PM6/17/14
to julia...@googlegroups.com
I run the code on 0.3.0. It did not improve things (in fact, there was a 3-5% deterioration)

Alireza Nejati

unread,
Jun 17, 2014, 6:23:24 PM6/17/14
to julia...@googlegroups.com
> But for fastest transcendental function performance, I assume that one must use the micro-coded versions built into the processor's FPU--Is that what the fast libm implementations do?

Not at all. Libm's version of log() is about twice as fast as the CPU's own log function, at least on a modern x86_64 processor (really fast log implementations use optimized look-up tables). I had a look at your code and it seems that the 'consumption' variable is always in the very narrow range of 0.44950 to 0.56872. If you plot the log function in this tiny range, it is very flat and linear. I think that if you simply replaced it with a 2- or 4-part piecewise approximation, you could get significant speedup across the board, in julia, c++, and others, with only a very small approximation error.

Dahua Lin

unread,
Jun 17, 2014, 7:39:05 PM6/17/14
to julia...@googlegroups.com
Perhaps we should first profile the codes, and see which part constitutes the bottleneck?

Dahua

Alireza Nejati

unread,
Jun 17, 2014, 8:23:37 PM6/17/14
to julia...@googlegroups.com
Dahua: On my setup, most of the time is spent in the log function.

Dahua Lin

unread,
Jun 18, 2014, 7:22:28 AM6/18/14
to julia...@googlegroups.com
Good to know. Looks like that we should consider improve the performance of openlibm, especially for commonly used functions such as exp, log, etc.

Cristóvão Duarte Sousa

unread,
Jun 19, 2014, 6:09:31 AM6/19/14
to julia...@googlegroups.com
The reason I did readdlm("numba_times.dat")[10:end] discarding the first times is that at least the very first one is much much higher that the others.
It seems that Numba also performs better after a warm-up.

Therefore, to be fair with Numba, we must discard the first function call time as we often do with Julia.

Using this commit which runs the main function twice for the Numba test, I get times around this:

1st run elapse time = is  4.60552310944
2nd run elapse time = is  4.24891805649

which I think is significant.

Andrew

unread,
Jan 30, 2016, 3:12:44 PM1/30/16
to julia-users
I just ran several of these benchmarks using the code and compilation flags available at https://github.com/jesusfv/Comparison-Programming-Languages-Economics . On my computer Julia is faster than C, C++, and Fortran, which I find surprising, unless some really dramatic optimization happened since 0.2.

My results are, on a Linux machine:

Julia 0.4.2: 1.44s
Julia 0.3.13 1.60s
C, gcc 4.8.4: 1.65s
C++, g++: 1.64s
Fortran, gfortran 4.8.4: 1.65s
Matlab R2015b : 5.65s
Matlab R2015b, Mex inside loop: 1.83s
Python 2.7: 50.9s 
Python 2.7 Numba: 1.88s with warmup

It's possible there's something bad about my configuration as I don't normally use C and Fortran. In the paper their C/Fortran code runs in 0.7s, I don't think their computer is twice as fast as mine, but maybe it is. 

Scott Jones

unread,
Jan 30, 2016, 6:46:29 PM1/30/16
to julia-users


On Saturday, January 30, 2016 at 3:12:44 PM UTC-5, Andrew wrote:
I just ran several of these benchmarks using the code and compilation flags available at https://github.com/jesusfv/Comparison-Programming-Languages-Economics . On my computer Julia is faster than C, C++, and Fortran, which I find surprising, unless some really dramatic optimization happened since 0.2.

From the 9 months since I learned about Julia, I have seen major improvements in different areas of performance (starting with v0.3.4, I'm now using v0.4.3 for work, and v0.5 master for fun).
Just look at the speedup from the *very* recent merge of https://github.com/JuliaLang/julia/pull/13412, wonderful stuff. Warts and rough edges are getting removed also, it just keeps getting better.
Reply all
Reply to author
Forward
0 new messages