Julia vs Matlab: interpolation and looping

1,548 views
Skip to first unread message

pokerho...@googlemail.com

unread,
Jan 29, 2016, 8:02:53 PM1/29/16
to julia-users
Hi,

my original problem is a dynammic programming problem in which I need to interpolate the value function on an irregular grid using a cubic spline method. I was translating my MATLAB code into Julia and used the Dierckx package in Julia to do the interpolation (there weren't some many alternatives that did spline on an irregular grid as far as I recall). In MATLAB I use interp1. It gave exactly the same result but it was about 50 times slower which puzzled me. So I made this (http://stackoverflow.com/questions/34766029/julia-vs-matlab-why-is-my-julia-code-so-slow) stackoverflow post.

The post is messy and you don't need to read through it I think. The bottom line was that the Dierckx package apparently calls some Fortran code which seems to pretty old (and slow, and doesn't use multiple cores. Nobody knows what exactly the interp1 is doing. My guess is that it's coded in C?!).

So I asked a friend of mine who knows a little bit of C and he was so kind to help me out. He translated the interpolation method into C and made it such that it uses multiple threads (I am working with 12 threads). He also put it on github here (https://github.com/nuffe/mnspline). Equipped with that, I went back to my original problem and reran it. The Julia code was still 3 times slower which left me puzzled again. The interpolation itself was much faster now than MATLAB's interp1 but somewhere on the way that advantage was lost. Consider the following minimal working example preserving the irregular grid of the original problem which highlights the point I think (the only action is in the loop, the other stuff is just generating the irregular grid):

MATLAB:

spacing=1.5;
Nxx = 300 ;
Naa = 350;
Nalal = 200;
sigma
= 10 ;
NoIter = 10000;

xx
=NaN(Nxx,1);
xmin
= 0.01;
xmax
= 400;
xx
(1) = xmin;
for i=2:Nxx
    xx
(i) = xx(i-1) + (xmax-xx(i-1))/((Nxx-i+1)^spacing);
end

f_util
=  @(c) c.^(1-sigma)/(1-sigma);
W
=NaN(Nxx,1);
W
(:,1) = f_util(xx);

W_temp
= NaN(Nalal,Naa);
xprime
= NaN(Nalal,Naa);

tic
for banana=1:NoIter
%   tic
  xprime
=ones(Nalal,Naa);
  W_temp
=interp1(xx,W(:,1),xprime,'spline');
%   toc
end
toc


Julia:

include("mnspline.jl")

spacing
=1.5
Nxx = 300
Naa = 350
Nalal = 200
sigma
= 10
NoIter = 10000

xx
=Array(Float64,Nxx)
xmin
= 0.01
xmax
= 400
xx
[1] = xmin
for i=2:Nxx
    xx
[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end

f_util
(c) =  c.^(1-sigma)/(1-sigma)
W
=Array(Float64,Nxx,1)
W
[:,1] = f_util(xx)


spl
= mnspline(xx,W[:,1])

function performance(NoIter::Int64)
    W_temp
= Array(Float64,Nalal*Naa)
    W_temp2
= Array(Float64,Nalal,Naa)
    xprime
=Array(Float64,Nalal,Naa)
   
for banana=1:NoIter
        xprime
=ones(Nalal,Naa)
        W_temp
= spl(xprime[:])
   
end
    W_temp2
= reshape(W_temp,Nalal,Naa)
end

@time performance()


In the end I want to have a matrix, that's why I do all this reshaping in the Julia code. If I comment out the loop and just consider one iteration, Julia does it in  (I ran it several times, precompiling etc)
0.000565 seconds (13 allocations: 1.603 MB)

MATLAB on the other hand: Elapsed time is 0.007864 seconds.

The bottom line being that in Julia the code is much faster (more than 10 times in this case), which should be since I use all my threads and the method is written in C. However, if I don't comment out the loop and run the code as posted above:

Julia:

3.205262 seconds (118.99 k allocations: 15.651 GB, 14.08% gc time)

MATLAB:
Elapsed time is 4.514449 seconds.


If I run the whole loop apparently MATLAB catches up a lot. It is still slower in this case. In the original problem though Julia was only about 3 times faster within the loop and once I looped through MATLAB turned out be 3 times faster. I am stuck here, does anybody have an idea what might be going? / If I am doing something wrong in my Julia code? A hint what I could do better?
Right now I am trying to parallelize also the loop. But that's obviously unfair compared with MATLAB because you could also parallelize that (If you buy that expensive toolbox).








Andrew

unread,
Jan 29, 2016, 9:32:32 PM1/29/16
to julia-users
Your loop has a ton of unnecessary allocation. You can move xprime=ones(Nalal,Naa) outside the loop.
Also, you are converting xprime to a vector at every iteration. You can also do this outside the loop.

After the changes, I get

julia> include("test2.jl");
WARNING: redefining constant lib
  3.726049 seconds (99.47 k allocations: 15.651 GB, 6.56% gc time)

julia> include("test3.jl");
WARNING: redefining constant lib
  2.352259 seconds (29.40 k allocations: 5.219 GB, 3.29% gc time)

in test3 the performance function is 

function performance(NoIter::Int64)
    W_temp = Array(Float64,Nalal*Naa)
    W_temp2 = Array(Float64,Nalal,Naa)
    xprime=Array(Float64,Nalal,Naa)
    xprime=ones(Nalal,Naa)
    xprime = xprime[:]
    for banana=1:NoIter
        W_temp = spl(xprime)
    end
    W_temp2 = reshape(W_temp,Nalal,Naa)
end

I don't know if you have this in your original code though. Maybe it's just your example.

I also have need for fast cubic splines because I do dynamic programming, though I don't think Dierckx is a bottleneck for me. I think the Interpolations package might soon have cubic splines on a non-uniform grid, and it's very fast.

Tomas Lycken

unread,
Jan 30, 2016, 4:21:45 AM1/30/16
to julia-users
I'd love to add non-uniform interpolation schemes of higher degree than linear to Interpolations.jl - the two main reasons for why it hasn't been done already are time (focus has been on reaching feature parity with Grid.jl, for which the package started out as a replacement effort) and knowledge (I don't know about good algorithms for it). The second one is easily solved if you can link to a few good papers on the subject and/or liberally licensed implementations - the first one is best fixed with a PR :)

// T

pokerho...@googlemail.com

unread,
Jan 30, 2016, 8:59:27 AM1/30/16
to julia-users
@Tomas: maybe check out Numerical Recipes in C: The Art of Scientific Computing, 2nd edition. There is also an edition for Fortran. The code that I use in C is basically from there.

@Andrew: The xprime needs to be in the loop. I just made it ones to simplify but normally it changes every iteration. (In the DP problem, the loop is calculating an expecation and xprime is the possible future value of the state variable for each state of the world). Concerning the Dierckx package. I don't know about the general behaviour but for my particular problem (irregular grid + cubic spline) it is very slow. Run the following code:

using Dierckx


spacing
=1.5
Nxx = 300
Naa = 350
Nalal = 200
sigma
= 10
NoIter = 10000

xx
=Array(Float64,Nxx)
xmin
= 0.01
xmax
= 400
xx
[1] = xmin
for i=2:Nxx
    xx
[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end

f_util
(c) =  c.^(1-sigma)/(1-sigma)
W
=Array(Float64,Nxx,1)
W
[:,1] = f_util(xx)



spl
= Spline1D(xx,W[:,1])

function performance2(NoIter::Int64)

    W_temp
= Array(Float64,Nalal*Naa)
    W_temp2
= Array(Float64,Nalal,Naa)
    xprime
=Array(Float64,Nalal,Naa)
   
for banana=1:NoIter
        xprime
=ones(Nalal,Naa)
        W_temp
= spl(xprime[:])
   
end
    W_temp2
= reshape(W_temp,Nalal,Naa)
end

@time performance2(10000)
30.878093 seconds (100.01 k allocations: 15.651 GB, 2.19% gc time)


That's why I went on and asked my friend to help me out in the first place. I  think the mnspline is really (not saying it's THE fastest) fast in doing the interpolation itself (magnitudes faster than MATLAB). But then I just don't understand how MATLAB can catch up by just looping through the same operation over and over. Intuitively (maybe I'm wrong) it should be somewhat proportional. If my code in Julia is 10 times faster whitin a loop, and then I just repeat the operation in that particular loop very often, how can it turn out to be only equally fast as MATLAB. Again, the mnspline uses all my threads maybe it has something to do with overhead, whatever. I don't know, hints appreciated.

Lutfullah Tomak

unread,
Jan 30, 2016, 9:42:48 AM1/30/16
to julia-users
If you do not change length of xprime or use it later for another purposes then just update existing array instead of re-allocating each time. Also, using global variables in the innermost loop is very inefficient in Julia.
It would be good to revise the code in the light of this tips from docs http://docs.julialang.org/en/release-0.4/manual/performance-tips/

Isaiah Norton

unread,
Jan 30, 2016, 10:02:40 AM1/30/16
to julia...@googlegroups.com
Numerical Recipes in C: The Art of Scientific Computing, 2nd edition. 

Please note that code derived from this book cannot be included in BSD, LGPL, or GPL licensed libraries (which is to say, most Julia packages). Distribution is restricted to compiled binaries only, with commercial and non-commercial license variants. Source redistribution is prohibited.

Numerical Recipes should not be relied on for any implementations that are submitted to Julia core or the open source Julia package ecosystem.

Tim Holy

unread,
Jan 30, 2016, 10:57:19 AM1/30/16
to julia...@googlegroups.com
My understanding is that it's fine to read the text & mathematical description,
but you shouldn't look at the code.

Best,
--Tim

Andrew

unread,
Jan 30, 2016, 11:47:51 AM1/30/16
to julia-users
When I say Dierckx isn't a bottleneck for me, I mean my own code spends most of its time doing things other than interpolation, like solving non-linear equations and other calculations. All your loop does is interpolate, so there it must be the bottleneck. 

For the expectation, you can reuse the same vector. You could also devectorize it and compute the expectation point by point, though I don't know if this any faster.

Maybe you have problems with unnecessary allocation and global variables in the original code.

pokerho...@googlemail.com

unread,
Jan 30, 2016, 1:13:19 PM1/30/16
to julia-users
Ok. My original code certainly spends most of the time on looping the interpolation. In that sense, the example I post here is similar to the original code and hightlights the problem I am facing I think. Fwiw, I wrap the original code in a function. I also do it above, at least for the performance relevant part I guess. If I do it for the whole code above -- no difference.

The first time I translated the code from MATLAB to Julia, I actually computed the expectation point by point with 2 double loops (or also with 1 loop and then reshaping) -- it was slower than vectorizing. (see also the stackoverflow post) .


@Lutfullah Tomak: I don't really get that point, maybe I am misunderstanding something but the innermost part of the loop is wrapped in a function, so there should be no global variable?!

Lutfullah Tomak

unread,
Jan 30, 2016, 3:03:17 PM1/30/16
to julia-users
I did not pay attention to stackoverflow post. There all code is wrapped around a function for some.
However, I was talking about examples here as in


for banana=1:NoIter
xprime=ones(Nalal,Naa)
W_temp = spl(xprime[:])
end

If all code run as it shown in the example here then Nalal, Naa are going to be checked for their types and 'ones' will be dispatced accordingly for each iteration in the loop.
Having looked at stackoverflow post, there are some differences.
I cannot try the codes right now. I wish I could tell something practical perspective using the code actually.

Regards

Lutfullah Tomak

unread,
Feb 2, 2016, 10:50:54 AM2/2/16
to julia-users
I tried this examples. It only improves if xprime is not allocated over and over. Instead, try fill!(xprime,1.0) for ones(...). Also, colon indexing xprime[:] allocates memory. Instead, you can use reinterpret/reshape. In real code, xprime should not be re-allocated with ones method.
You can update it with @inbounds macro in for loops. At least, updating xprime and not using xprime[:] solves wasted allocations issue (by doing these in my trials overall allocated memory was around 1/3 of the original example).

Andre Bieler

unread,
Feb 2, 2016, 1:03:28 PM2/2/16
to julia-users

I found that reducing memory allocation in the loop does not do much in terms of speed.
E.g. when doing something like

xx = xprime[:]

the timing difference between 

sql(xprime[:]) and sql(xx)


is only about 5%. so my guess is most of the time is just
spent inside the sql() function call and this had to be optimized in order to speed things up.
(no i did not do xx = xprime[:] inside the loop...)

Is the sql() really faster than the matlab version of it?

cheers


Lutfullah Tomak

unread,
Feb 2, 2016, 1:48:14 PM2/2/16
to julia-users
Hi 
I attached the code I used and timings I get.
I used MSpline. If Dierckx is used my changes will be
less visible in timings since interpolations
take much more time.
Regards
splinetry.jl
timings.txt

pokerho...@googlemail.com

unread,
Feb 3, 2016, 5:12:36 AM2/3/16
to julia-users
Thanks for the research. I also did some testing with the original code and to me it seems like the problem has nothing to do with the interpolation method but with the memory allocation. Matlab is just faster because it doesn't reallocate memory in every iteration, as you said. It also explains my timings I guess. Matlab with 1 iteration, relatively slow but once it's looped, it doesn't reallocate memory. So I am going to look into the @inbounds and so on. I am pretty confident that the spl is faster than Matlabs interp1.


nuffe

unread,
Feb 3, 2016, 1:54:49 PM2/3/16
to julia-users
1) Your Matlab code is not doing exactly the same as the Julia code

2) The answer to your question about the difference between timings and looped timings  can be seen from these two lines

0.000565 seconds (13 allocations: 1.603 MB)

 3.205262 seconds (118.99 k allocations: 15.651 GB, 14.08% gc time)

NoIter * 1.603 MB ~= 15.651 GB

Your Julia code allocated a new output array on every iteration. For stuff like this Matlab's execution engine allocates an output array once and performs the rest of the computations inplace.
 
3) It is not clear what either of the examples are supposed to do
Reply all
Reply to author
Forward
0 new messages