Help: My "parallel" code 8300x slower than the serial.

Daniel Carrera

unread,

Jun 17, 2015, 9:37:28 AM6/17/15

to julia...@googlegroups.com

Hi everyone,

My adventures with parallel programming with Julia continue. Here is a different issue from other threads: My parallel function is 8300x slower than my serial function even though I am running on 4 processes on a multi-core machine.

julia> nprocs()

4

I have Julia 0.3.8. Here is my program in its entirety (not very long).

function main()

nbig::Int16 = 7

nbod::Int16 = nbig

bod = Float64[

0 1 2 3 4 5 6 # x position

0 0 0 0 0 0 0 # y position

0 0 0 0 0 0 0 # z position

0 0 0 0 0 0 0 # x velocity

0 0 0 0 0 0 0 # y velocity

0 0 0 0 0 0 0 # z velocity

1 1 1 1 1 1 1 # Mass

]

a = zeros(3,nbod)

@time for k = 1:1000

gravity_1!(bod, nbig, nbod, a)

end

println(a[1,:])

@time for k = 1:1000

gravity_2!(bod, nbig, nbod, a)

end

println(a[1,:])

end

function gravity_1!(bod, nbig, nbod, a)

for i = 1:nbod

a[1,i] = 0.0

a[2,i] = 0.0

a[3,i] = 0.0

end

@inbounds for i = 1:nbig

for j = (i + 1):nbod

dx = bod[1,j] - bod[1,i]

dy = bod[2,j] - bod[2,i]

dz = bod[3,j] - bod[3,i]

s_1 = 1.0 / sqrt(dx*dx+dy*dy+dz*dz)

s_3 = s_1 * s_1 * s_1

tmp1 = s_3 * bod[7,i]

tmp2 = s_3 * bod[7,j]

a[1,j] = a[1,j] - tmp1*dx

a[2,j] = a[2,j] - tmp1*dy

a[3,j] = a[3,j] - tmp1*dz

a[1,i] = a[1,i] + tmp2*dx

a[2,i] = a[2,i] + tmp2*dy

a[3,i] = a[3,i] + tmp2*dz

end

return a

end

function gravity_2!(bod, nbig, nbod, a)

for i = 1:nbod

a[1,i] = 0.0

a[2,i] = 0.0

a[3,i] = 0.0

end

@inbounds @sync @parallel for i = 1:nbig

for j = (i + 1):nbod

dx = bod[1,j] - bod[1,i]

dy = bod[2,j] - bod[2,i]

dz = bod[3,j] - bod[3,i]

s_1 = 1.0 / sqrt(dx*dx+dy*dy+dz*dz)

s_3 = s_1 * s_1 * s_1

tmp1 = s_3 * bod[7,i]

tmp2 = s_3 * bod[7,j]

a[1,j] = a[1,j] - tmp1*dx

a[2,j] = a[2,j] - tmp1*dy

a[3,j] = a[3,j] - tmp1*dz

a[1,i] = a[1,i] + tmp2*dx

a[2,i] = a[2,i] + tmp2*dy

a[3,i] = a[3,i] + tmp2*dz

end

return a

end

So this is a straight forward N-body gravity calculation. Yes, I realize that gravity_2!() is wrong, but that's fine. Right now I'm just talking about the CPU time. When I run this on my computer I get:

julia> main()

elapsed time: 0.000475294 seconds (0 bytes allocated)

[1.4913888888888889 0.4636111111111111 0.1736111111111111 -5.551115123125783e-17 -0.17361111111111116 -0.4636111111111112 -1.4913888888888889]

elapsed time: 3.953546654 seconds (126156320 bytes allocated, 13.49% gc time)

[0.0 0.0 0.0 0.0 0.0 0.0 0.0]

So, the serial version takes 0.000475 seconds and the parallel takes 3.95 seconds. Furthermore, the parallel version is calling the garbage collector. I suspect that the problem has something to do with the memory access. Maybe the parallel code is wasting a lot of time copying variables in memory. But whatever the reason, this is bad. The documentation says that @parallel is supposed to be fast, even for very small loops, but that's not what I'm seeing. A non-buggy implementation will be even slower.

Have I missed something? Is there an obvious error in how I'm using the parallel constructs?

I would appreciate any guidance you may offer.

Cheers,

Daniel.

Tim Holy

unread,

Jun 17, 2015, 11:02:18 AM6/17/15

to julia...@googlegroups.com

You're copying a lot of data between processes. Check out SharedArrays. But I
still fear that if each "job" is tiny, you may not get as much benefit without
further restructuring.

I trust that your "real" workload will take more than 1ms. Otherwise, it's
very unlikely that your experiments in parallel programming will end up saving
you time :-).

--Tim

Daniel Carrera

unread,

Jun 17, 2015, 4:38:15 PM6/17/15

to julia...@googlegroups.com

Sadly, this function is pretty close to the real workload. I do n-body simulations of planetary systems. In this problem, the value of "n" is small, but the simulation runs for a very long time. So this nested for loop will be called maybe 10 billion times in the course of one simulation, and that's where all the CPU time goes.

I already have simulation software that works well enough for this. I just wanted to experiment with Julia to see if this could be made parallel. An irritating problem with all the codes that solve planetary systems is that they are all serial -- this problem is apparently hard to parallelize.

Cheers,

Daniel.

--

When an engineer says that something can't be done, it's a code phrase that means it's not fun to do.

Miguel Bazdresch

unread,

Jun 17, 2015, 5:19:03 PM6/17/15

to julia...@googlegroups.com

Can you arrange the problem so that you send each CPU a few seconds of work? The overhead would become negligible.

-- mb

Angel de Vicente

unread,

Jun 17, 2015, 6:07:20 PM6/17/15

to julia...@googlegroups.com

Hi,

Daniel Carrera <dcar...@gmail.com> writes:
> I already have simulation software that works well enough for this. I
> just wanted to experiment with Julia to see if this could be made
> parallel. An irritating problem with all the codes that solve
> planetary systems is that they are all serial -- this problem is
> apparently hard to parallelize.

I was not very lucky with getting Julia in parallel to perform
efficiently, but (and this is a bit off-topic) parallelizing simple
N-body codes is quite easy. I have a demo code for this in Fortran. The
serial part just does basically the same as your Julia code (plus
calculate also the new velocities and positions for all the bodies):

,----
| DO t = 0.0, t_end, dt
| v = v + a * dt/2
| r = r + v * dt
|
| a = 0.0
| DO i = 1,n
| DO j = i+1,n
| rji = r(j,:) - r(i,:)
| r2 = SUM(rji**2)
| r3 = r2 * SQRT(r2)
| a(i,:) = a(i,:) + m(j) * rji / r3
| a(j,:) = a(j,:) - m(i) * rji / r3
| END DO
| END DO
|
| v = v + a * dt/2
|
| t_out = t_out + dt
| IF (t_out >= dt_out) THEN
| DO i = 1,n
| PRINT*, r(i,:)
| END DO
| t_out = 0.0
| END IF
|
| END DO
`----

The parallel version (with MPI) of this toy code is quite efficient. For
1000 bodies and 10k iterations, the serial code at my work station
measured with 'time' takes ~138 seconds. The parallel version (running
in 4 processors) takes only ~33 seconds (by the way, showing that
super-linear speedup, though unusual, is possible :-)

[angelv@comer ~/NBODY]$ time ./nbody_serial < stars_sphere.txt >
stars_serial.out
137.414u 0.079s 2:17.83 99.7%0+0k 0+9056io 0pf+0w

[angelv@comer ~/NBODY]$ time mpirun -np 4 ./nbody_parallel <
stars_sphere.txt > stars_parallel.out
110.891u 0.954s 0:32.80 340.9%0+0k 15128+8992io 64pf+0w

Since this doesn't involve Julia at all, if you want further details or
the code itself, perhaps we can talk off-list to avoid non-Julia noise
in the list.

Cheers,
--
Ángel de Vicente
http://www.iac.es/galeria/angelv/

Stefan Karpinski

unread,

Jun 18, 2015, 12:49:42 PM6/18/15

to Julia Users

On Wed, Jun 17, 2015 at 4:38 PM, Daniel Carrera <dcar...@gmail.com> wrote:

An irritating problem with all the codes that solve planetary systems is that they are all serial -- this problem is apparently hard to parallelize.

That appears to be a language-independent issue. Even with mature threading, it doesn't seem like this problem would scale well. It might be possible to figure out a clever way to parallelize it, but that is a serious algorithmic research problem. A superficial Google Scholar search indicates that some work has been done in this direction using tree data structures.

Daniel Carrera

unread,

Jun 18, 2015, 2:42:02 PM6/18/15

to julia...@googlegroups.com

Yeah, this is an algorithmic issue and many people much smarter than me are working on it. I was just hoping that using shared memory would give me a small amount of parallelism without clever algorithms.

Tree codes are the standard solution for systems like galaxies, where the number of particles is very large, but the dynamical age is small. In a galaxy you have 200 billion particles that have each done 40 orbits.

For planet systems the problem is reversed. We have few particles and long ages (Earth has gone around the Sun 4 billion times). This means that small errors can really accumulate. For these problems the state of the art is either very high order integrators, or "symplectic" integrators which are designed to preserve a Hamiltonian that is closely related to the Hamiltonian of the system.

Unfortunately, unlike the tree codes, none of these solutions are parallel. Because every particle affects every particle and you can't afford to throw away any information, the problem is very resistant to parallelization. I was wondering if I could get a free lunch by using shared memory.

Stefan Karpinski

unread,

Jun 18, 2015, 2:52:56 PM6/18/15

to Julia Users

Unfortunately, I suspect that even with threads and fully shared process memory this would not parallelize well.

Scott Jones

unread,

Jun 19, 2015, 10:20:33 AM6/19/15

to julia...@googlegroups.com

Not being an astronomer, I don't know how this is supposed to be, but you have calculations based on the bod array inside the loop, but you are never updating bod... Are you just depending on the compiler to move those out of the loop?

Daniel Carrera

unread,

Jun 19, 2015, 10:47:20 AM6/19/15

to julia...@googlegroups.com

On 19 June 2015 at 16:20, Scott Jones <scott.pa...@gmail.com> wrote:

Not being an astronomer, I don't know how this is supposed to be, but you have calculations based on the bod array inside the loop, but you are never updating bod... Are you just depending on the compiler to move those out of the loop?

Oh, the program is not complete. In a real program the "bod" array would be updated elsewhere. Updating the bod requires O(n) additions, multiplications and memory look-ups. Computing the forces on each particle requires O(n^2) additions, multiplications, divisions and memory look-ups. So all attempts at optimization focus on the O(n^2) part, so that's the only part that I posted.

Cheers,

Daniel

Scott Jones

unread,

Jun 19, 2015, 11:53:05 AM6/19/15

to julia...@googlegroups.com

Is your 'n' only the planets, large moons, dwarf planets, asteroids, and the Sun? Or is it the age, that you mentioned was in the billions, for orbits... (sorry for my dumb questions, but the parallel part of this is rather interesting to me!)

Daniel Carrera

unread,

Jun 19, 2015, 12:21:43 PM6/19/15

to julia...@googlegroups.com

Sorry, I should have been more clear. My "n" is the number of massive bodies. For my simulations, that means the Sun and between 4 and 8 planets. The reason the problem is O(n^2) is because the gravity of every massive particle affects every other particle. With the Sun and 4 planets you have 5*4 = 20 force calculations. But between 1000 stars (e.g. a star cluster) you have 1000 * 999 = 999,000 force calculations.

There are roughly two common ways to deal with this problem:

1) One way is to make an approximate algorithm that doesn't compute every single force exactly. For example, a tree code is O(n log n) instead of O(n^2). The result is less accurate but it scales better for large "n". This is what you'd use if you wanted to model a galaxy. This may sound strange, but in terms of dynamics, the galaxy is very young. In its entire history, the Milky Way has only gone around about 40 times.

2) The other way is to keep the O(n^2) function but try to call it less often. You do this for problems where a less accurate result is not acceptable. That happens in planetary systems. Because there are many orbits, the errors can really accumulate. If you had an error of 0.00001% per orbit, the result would be completely useless after only a million years. Planet formation takes about 100 million years, and Earth is 4,500 million years old.

Cheers,

Daniel.

Jiahao Chen

unread,

Jun 22, 2015, 7:48:29 PM6/22/15

to julia...@googlegroups.com

The difficulties inherent in parallelizing n-body problems are well-known; see the related thread several months ago about Lennard-Jones molecular dynamics simulations. There is nothing much you can do other than using fast multipole methods (tree codes) and multi-time step integrators. In fact, n-body simulations were one of the main reasons why tree codes were invented in the first place.

Thanks,

Jiahao Chen

Research Scientist

MIT CSAIL

Reply all

Reply to author

Forward