laplace equation benchmark

Rajeev Singh

unread,

Jul 21, 2012, 1:48:47 AM7/21/12

to juli...@googlegroups.com

Hi,

I have implemented the example discussed in http://www.scipy.org/PerformancePython/ in julia. The performance of my machine is -

Octave (0): 14.9932 seconds
GNU R (0): 56.4161 seconds
Julia (0): 39.1679 seconds
gcc intel
1 : 13.4774 * NumPy
2 : * 1.6633 Cilk Plus
3, 4 : 2.4963 1.9179 Vectorized Fortran (pure)

Julia's performance seems to be significantly lower than Octave and Numpy. I am surprised at this result. I hope someone can explain.

Best,

Rajeev

The julia code is -

const dx = 0.1;

const dy = 0.1;

const dx2 = dx*dx;

const dy2 = dy*dy;

const N = 150;

const Niter = 2^15;

u = zeros(N,N);

u[1,:] = 1;

for i = 1:Niter

u[2:N-1, 2:N-1] = ((u[1:N-2, 2:N-1] + u[3:N, 2:N-1])*dy2 + (u[2:N-1,1:N-2] + u[2:N-1, 3:N])*dx2) * (1./ (2*(dx2+dy2)));

end

Jeff Bezanson

unread,

Jul 21, 2012, 2:45:22 AM7/21/12

to juli...@googlegroups.com

Hi Rajeev,

At the moment, julia has quite different performance characteristics
than python/numpy. I implemented the looping version, and in julia it
is about 8x faster than the vectorized version. Over time we will try
to improve our vectorized performance.

And, although not important for this problem, at N=10 both versions in
julia are significantly faster than numpy.

> --
>
>
>

Tim Holy

unread,

Jul 21, 2012, 5:11:04 AM7/21/12

to juli...@googlegroups.com

Hi Rajeev,

To amplify what Jeff said a bit more: as written, the simplest implementation
of your "vectorized" code implies the creation of many temporary objects. For
example, each reference operation u[i:j] creates a new array with a copy of
the data in the original u. All that memory allocation and copying is surely
the major bottleneck in your code.

At the moment, in Julia it's often best to write your code out explicitly
using for loops, because it avoids most of the creation of temporaries. Jeff
indicated you get an 8-fold improvement this way, which makes Julia's time the
best of all the non-parallel implementations (by a large margin).

If you're used to writing vectorized code, this can seem like a step
backwards, and sometimes it is. But it's also the case that, when you're
actually required to use vectorization to achieve performance, sometimes you
have to dance through awkward hoops. So there's also much to be said for the
ability to write fast loops.

Some of the other interpreted languages may use some clever techniques to
reduce the need for creating temporaries. One effort to do that in Julia is
here:
https://github.com/kk49/julia-delayed-matrix
But I'm sure that as Julia's compiler evolves (it is still very young), these
types of features will start being incorporated transparently into the
language itself.

Best,
--Tim

Krzysztof Kamieniecki

unread,

Jul 21, 2012, 7:28:07 PM7/21/12

to juli...@googlegroups.com

A challenge ! :)

Let me see what I can do, I have to add sub-array/indexing functionality to demat...

Rajeev Singh

unread,

Jul 28, 2012, 4:26:46 AM7/28/12

to juli...@googlegroups.com

Hi,

Sorry for the late response. My looped code is much slower than the
vectorized one. The changed code is -

#u[2:N-1, 2:N-1] = ((u[1:N-2, 2:N-1] + u[3:N, 2:N-1])*dy2 +

(u[2:N-1,1:N-2] + u[2:N-1, 3:N])*dx2) * (1./ (2*(dx2+dy2)));

for j = 2:N-1
for k = 2:N-1
u[j,k] = ((u[j-1,k] + u[j+1,k])*dy2 + (u[j,k-1] +
u[j,k+1])*dx2) * (1./ (2*(dx2+dy2)));
end
end

Rajeev

> --
>
>
>

Tim Holy

unread,

Jul 28, 2012, 12:12:51 PM7/28/12

to juli...@googlegroups.com

On Saturday, July 28, 2012 01:56:46 PM Rajeev Singh wrote:
> Sorry for the late response. My looped code is much slower than the
> vectorized one.

It's because doing things in the global workspace is slow. Put it inside a
function, and it will be fast. Inside functions you can do robust type
analysis and convince the compiler that the type won't change. Apparently
that's much harder for global variables.

This issue bites everyone sooner or later.

--Tim

Rajeev Singh

unread,

Jul 29, 2012, 8:28:08 AM7/29/12

to juli...@googlegroups.com

Didn't help and actually worse now. The new benchmark results are -

Julia (0): 44.6231 seconds

Matlab Coder + Compiler (0): 7.1128 seconds

gcc intel

1 : 13.3564 * NumPy

2 : * 1.6148 Cilk Plus

3, 4 : 2.4686 1.9959 Vectorized Fortran (pure)

5, 6 : 2.9018 1.8919 Vectorized C (pure)

7, 8 : 2.8206 1.7980 Vectorized CPP (pure)

Any more suggestions?

Rajeev

> --
>
>
>

Tim Holy

unread,

Jul 29, 2012, 1:37:37 PM7/29/12

to juli...@googlegroups.com

Can you post your whole Julia code? Something needs fixing (you should never
see #s like that), but it's very difficult to figure out what it is without
knowing exactly what you're doing.

--Tim

Rajeev Singh

unread,

Jul 29, 2012, 2:06:04 PM7/29/12

to juli...@googlegroups.com

Here is the full code -

function main(Niter)

const dx = 0.1;

const dy = 0.1;

const dx2 = dx*dx;

const dy2 = dy*dy;

const N = 150;

# const Niter = 2^10;

u = zeros(N,N);

u[1,:] = 1;

u1 = u;

for i = 1:Niter

# u[2:N-1, 2:N-1] = ((u[1:N-2, 2:N-1] + u[3:N, 2:N-1])*dy2 + (u[2:N-1,1:N-2] + u[2:N-1, 3:N])*dx2) * (1./ (2*(dx2+dy2)));

for j = 2:N-1

for k = 2:N-1

u1[j,k] = ((u[j-1,k] + u[j+1,k])*dy2 + (u[j,k-1] + u[j,k+1])*dx2) * (1./ (2*(dx2+dy2)));

end

u = u1;

end

return u[1,1]

end

Niter = int( ARGS[1] )

a = main(Niter)

--Tim

--

Tim Holy

unread,

Jul 29, 2012, 3:41:20 PM7/29/12

to juli...@googlegroups.com

I'm not sure that I understand the purpose of everything you're doing in that
code, but does this do what you want?

File "laplace.jl":
function laplace(u::Matrix, dx, dy)
const dx2::eltype(u) = dx^2
const dy2::eltype(u) = dy^2
uout = zeros(eltype(u), size(u))
for i = 2:size(u,1)-1
for j = 2:size(u,2)-1
uout[i,j] = ((u[i-1,j]+u[i+1,j])*dx2 +
(u[i,j-1]+u[i,j+1])*dy2)/(2*(dx2+dy2))
end
end
return uout
end

function laplace_iter(u::Matrix, dx, dy, Niter::Int)

for i = 1:Niter

u = laplace(u, dx, dy)
end
end

N = 150

u = zeros(N,N)
u[1,:] = 1

Niter = 2^10
dx = 0.1
dy = 0.1

Interactive:

julia> load("laplace.jl")

julia> @time laplace_iter(u, dx, dy, Niter)
elapsed time: 0.3265519142150879 seconds

julia> @time laplace_iter(u, dx, dy, Niter)
elapsed time: 0.30145692825317383 seconds

julia> @time laplace_iter(u, dx, dy, Niter)
elapsed time: 0.297670841217041 seconds

Note the first time is slower because the code has to be compiled.

Rajeev Singh

unread,

Jul 30, 2012, 2:34:19 AM7/30/12

to juli...@googlegroups.com

Hi,

Still no improvement. Here are the benchmark results with your code -

Julia (0): 52.2765 seconds

Matlab Coder + Compiler (0): 6.9941 seconds

gcc intel

1 : 13.8478 * NumPy

2 : * 1.6003 Cilk Plus

3, 4 : 2.4711 1.8651 Vectorized Fortran (pure)

5, 6 : 2.9666 1.7616 Vectorized C (pure)

7, 8 : 2.8220 1.7653 Vectorized CPP (pure)

Let me mention that the results are obtained by running the programs through a terminal command and it includes the startup time (if any) which I think is about 2 seconds for both matlab and julia.

Could my Julia version be at fault? I am using Commit 488b0f923a (2012-06-17 15:48:26).

Rajeev

--

Jeff Bezanson

unread,

Jul 30, 2012, 3:24:55 AM7/30/12

to juli...@googlegroups.com

With the latest version I get times similar to Tim's; about 0.3 sec.
With the version you mention I get about 0.6 sec. Why would it differ
by a factor of 100? This is getting a bit silly.

> --
>
>
>

Rajeev Singh

unread,

Jul 30, 2012, 3:55:37 AM7/30/12

to juli...@googlegroups.com

All the benchmarks programs are running for 2^16 interations and not 2^10 interations. I am not claiming that 0.6 sec is 50 sec on my machine. I am comparing numpy to julia for the same no of iterations.

Rajeev

--

Tim Holy

unread,

Jul 30, 2012, 9:58:58 AM7/30/12

to juli...@googlegroups.com

On Monday, July 30, 2012 12:04:19 PM Rajeev Singh wrote:
> Let me mention that the results are obtained by running the programs
> through a terminal command and it includes the startup time (if any) which
> I think is about 2 seconds for both matlab and julia.

That seems suboptimal, although unlikely to account for the difference. I did
show how to time the Julia code.

> Could my Julia version be at fault? I am using Commit 488b0f923a
> (2012-06-17 15:48:26).

Array performance has changed a lot since then, although I'd be a bit
surprised if it were significant for 2d.

On my machine, with N = 150 and Niter = 2^16, I get 18 seconds. That's close
enough to your NumPy result that any differences between machines might swamp
it, so I am skeptical about making that particular comparison.

I don't doubt we have a ways to go to catch up to the multi-threaded
implementations, but you could try a DArray-based Julia implementation.
Interesting that the Matlab coder one is so good. Is that multithreaded, too?

--Tim

Rajeev Singh

unread,

Jul 30, 2012, 11:39:42 AM7/30/12

to juli...@googlegroups.com

On Mon, Jul 30, 2012 at 7:28 PM, Tim Holy <tim....@gmail.com> wrote:
> On Monday, July 30, 2012 12:04:19 PM Rajeev Singh wrote:
>> Let me mention that the results are obtained by running the programs
>> through a terminal command and it includes the startup time (if any) which
>> I think is about 2 seconds for both matlab and julia.
>
> That seems suboptimal, although unlikely to account for the difference. I did
> show how to time the Julia code.

Running directly on terminal is a requirement for me as the big
computers we have are managed via pbs system.

>
>> Could my Julia version be at fault? I am using Commit 488b0f923a
>> (2012-06-17 15:48:26).
>
> Array performance has changed a lot since then, although I'd be a bit
> surprised if it were significant for 2d.

A git-related question here as I am new to using git. Do I need the
clone again to get the latest Julia or is there another way?

>
> On my machine, with N = 150 and Niter = 2^16, I get 18 seconds. That's close
> enough to your NumPy result that any differences between machines might swamp
> it, so I am skeptical about making that particular comparison.
>
> I don't doubt we have a ways to go to catch up to the multi-threaded
> implementations, but you could try a DArray-based Julia implementation.
> Interesting that the Matlab coder one is so good. Is that multithreaded, too?

I made sure to check that all the programs are running with single thread.

Rajeev

Stefan Karpinski

unread,

Jul 30, 2012, 11:48:58 AM7/30/12

to juli...@googlegroups.com

On Mon, Jul 30, 2012 at 11:39 AM, Rajeev Singh <rajs...@gmail.com> wrote:

A git-related question here as I am new to using git. Do I need the
clone again to get the latest Julia or is there another way?

This usually works:

git pull && make testall

Occasionally you need to do "make clean" and then "make testall" to get things working, depending on how much messing around with the internals has been going on.

Tim Holy

unread,

Jul 31, 2012, 4:09:51 AM7/31/12

to juli...@googlegroups.com

On Monday, July 30, 2012 09:09:42 PM Rajeev Singh wrote:
> I made sure to check that all the programs are running with single thread.

I wrote out your algorithm using direct BLAS calls. That got it down to ~8s on
my machine, from an improved version using plain Julia that was running at
~12s. There is a bigger advantage to using BLAS (I've seen up to 6-fold) when
the matrices get bigger. So I'm very glad I took the time to learn this, I'll
definitely use it in some work I'll be doing soon.

Files are attached. Change Niter to 2^16 before running @time laplace_iter(u,
dx, dy, Niter).

If we assume that BLAS is basically as good as it gets, and if my laptop is
not dreadfully slower than your machine, then there's something slightly
suspicious about your results that show running times <3s.

For the Cilk and various Fortran implementations, if you really are certain
they are restricted to using only one core (you checked via top, right?),
then...are you really sure the computation is running as you expect? The
reason I ask is that there were things in your Julia code that one could
imagine would cause it to just skip the entire computation: you were returning
u[1,1], and that value is always zero, so in principle a sufficiently-smart
compiler could have just optimized the entire computation away. In my code I
changed it to make sure that didn't happen. If it is getting simply optimized
away for some of your languages (e.g., Cilk & Fortran), that would explain the
suspiciously-fast results you're seeing. Of course, if it were being optimized
away, perhaps it should have run even faster, so I am not sure this explains
it.

If all really is as you expect, then either:
1. My machine is a lot slower than yours. Try this code, and see what happens.
2. There is something significantly better than OpenBLAS out there. It would be
interesting to know what it is.

Those seem to be the only two possibilities. I profiled the laplace_blas code,
and it's spending all of its time in BLAS calls. So by this point there is no
"julia" in there, really.

--Tim

laplace.jl

laplace_blas.jl

blas.jl

Rajeev Singh

unread,

Jul 31, 2012, 6:29:12 AM7/31/12

to juli...@googlegroups.com

Only in the julia program I return u[1,1]. In all other programs the
entire matrix u is written to a file.
Other programs can be found at http://dl.dropbox.com/u/5939736/laplace.zip
Its a bit messy but I hope one would be able to run them (or at least
the C, C++, Fortran and Cilk).

Rajeev

> --
>
>
>

Aron Ahmadia

unread,

Jul 31, 2012, 7:21:34 AM7/31/12

to juli...@googlegroups.com

If we assume that BLAS is basically as good as it gets, and if my laptop is
not dreadfully slower than your machine, then there's something slightly
suspicious about your results that show running times <3s.

That's a very bad assumption in this case. The overhead of the BLAS routine calls tends to make them less efficient than properly hand-vectorized C and Fortran routines for scalar-vector and vector-vector (level 1 and level 2) operations if you know what you are doing. Also, the BLAS routines are not well-mapped to streaming kernels like Jacobi/Gauss-Seidel.

As a second note, the only way to properly make these comparisons is to run Rajeev's C/Fortran benchmark code on the same machine that is running Julia. Given that he's just performing Gauss-Seidel sweeps over the matrix, it's not hard to come up with the percentage of peak flop/s and bandwidth you are doing on your machine.

I glanced at the various pieces of code fragments that have been posted, and I'm still new to Julia, but are you trying to do Jacobi or Gauss-Seidel sweeps? Some of the later Julia code is definitely applying Jacobi iterations, but the original Python example is Gauss-Seidel.

A

Tim Holy

unread,

Jul 31, 2012, 7:34:44 AM7/31/12

to juli...@googlegroups.com

On Tuesday, July 31, 2012 03:59:12 PM Rajeev Singh wrote:
> Only in the julia program I return u[1,1]. In all other programs the
> entire matrix u is written to a file.
> Other programs can be found at http://dl.dropbox.com/u/5939736/laplace.zip
> Its a bit messy but I hope one would be able to run them (or at least
> the C, C++, Fortran and Cilk).

I briefly tried building Cilk, but the build failed. But since the whole point
of Cilk is parallel computing, which means multithreaded, at this point I have
a hard time believing that all of these are single-threaded.

You could write the Julia version using a DArray if you wanted to make that
comparison.

--Tim

Rajeev Singh

unread,

Jul 31, 2012, 7:54:09 AM7/31/12

to juli...@googlegroups.com

Most of the programs do perform Gauss-Seidel sweep and I compare only
those. That is the reason I have a temporary array to hold the result
of each iteration when loops are used. Gauss-Seidel is the one I want
to check because the Jacobi iterations cannot be vectorized.

Regarding Cilk

$ icc -fast cilk_laplace.c
works for me giving the following timing -

$ echo 640000 | time ./a.out
15.39user 0.00system 0:15.46elapsed 99%CPU

It runs long enough for me to see that a single thread is running
(Note that 64000 is ~2^16 * 10 hence 15 sec is consistent with my
earlier 1.6 sec).

Rajeev

> --
>
>
>

Aron Ahmadia

unread,

Jul 31, 2012, 7:57:59 AM7/31/12

to juli...@googlegroups.com

Most of the programs do perform Gauss-Seidel sweep and I compare only
those. That is the reason I have a temporary array to hold the result
of each iteration when loops are used. Gauss-Seidel is the one I want
to check because the Jacobi iterations cannot be vectorized.

Are you confusing the two? Gauss-Seidel is frequently performed in-place, whereas Jacobi sweeps require a copy. Both are vectorizable, but it is trickier to do so with Gauss-Seidel.

A

Rajeev Singh

unread,

Jul 31, 2012, 10:58:36 AM7/31/12

to juli...@googlegroups.com

Probably I mixed up the two :) I meant Jacobi which is trivially vectorizable.

>
> A
>
> --
>
>
>

Viral Shah

unread,

Aug 17, 2012, 5:45:41 AM8/17/12

to juli...@googlegroups.com

Filed issue to track the laplace benchmark performance:

https://github.com/JuliaLang/julia/issues/1169

-viral

Stefan Karpinski

unread,

Aug 17, 2012, 12:42:08 PM8/17/12

to juli...@googlegroups.com

wrong link: https://github.com/JuliaLang/julia/issues/1168

--

Viral Shah

unread,

Aug 20, 2012, 5:45:44 AM8/20/12

to juli...@googlegroups.com

Hi Rajeev,

I implemented and cleaned up the vectorized and devectorized versions of the laplace benchmark in test/perf2. The devectorized version looks like this, and does exactly what the vectorized version does (but about 7 times faster), and runs in 3.8 seconds in julia:

function laplace_iter_devec(u, dx2, dy2, Niter, N)

uout = copy(u)

for iter = 1:Niter

for i = 2:N-1

for j = 2:N-1

uout[i,j] = ( (u[i-1,j]+u[i+1,j])*dy2 + (u[i,j-1]+u[i,j+1])*dx2 ) * (1./(2*(dx2+dy2)))

end

u, uout = uout, u

end

return u

end

function laplace_devec()

N = 150

u = zeros(N, N)

u[1,:] = 1

Niter = 2^15

dx2 = dy2 = 0.1*0.1

u = laplace_iter_devec(u, dx2, dy2, Niter, N)

end

On Saturday, July 21, 2012 11:18:47 AM UTC+5:30, Rajeev wrote:

Rajeev Singh

unread,

Aug 20, 2012, 6:28:34 AM8/20/12

to juli...@googlegroups.com

Hi,

The latest implementation is indeed much better. Following is the benchmark result now -

Julia (0): 10.8544 seconds

Matlab Coder + Compiler (0): 7.1042 seconds

Clang (0): 2.4669 seconds

gcc intel

1 : 13.6140 * NumPy

2 : * 1.6257 Cilk Plus

3, 4 : 2.5153 1.8934 Vectorized Fortran (pure)

5, 6 : 2.9180 1.7829 Vectorized C (pure)

7, 8 : 2.8382 1.7744 Vectorized CPP (pure)

Julia is doing better than numpy now. By the way the matlab's startup time is very high. Time taken by matlab coder is only about 2.7 secs for same parameters which is almost as good as gcc gets. Excluding the startup time Julia takes about 9.2 secs which is comparable to matlab without coder.

Rajeev

--

Viral Shah

unread,

Aug 20, 2012, 6:31:45 AM8/20/12

to juli...@googlegroups.com

Rajeev,

Can you post all the codes to a gist or email them to me? That would save me the hassle of rewriting them for the benchmark.

-viral

> --
>
>
>

Rajeev Singh

unread,

Aug 20, 2012, 6:40:47 AM8/20/12

to juli...@googlegroups.com

The codes are here - http://dl.dropbox.com/u/5939736/laplace.zip

Rajeev

> --
>
>
>

Patrick O'Leary

unread,

Aug 20, 2012, 8:30:22 AM8/20/12

to juli...@googlegroups.com

On Monday, August 20, 2012 5:28:34 AM UTC-5, Rajeev wrote:

Time taken by matlab coder is only about 2.7 secs for same parameters which is almost as good as gcc gets.

I should hope so; MATLAB's C code generation is very good these days, so the only real difference should be the runtime startup and the marshalling via the MEX interface to C.

Reply all

Reply to author

Forward