Cloud in Cell interpolator performance

53 views
Skip to first unread message

Kirill Streltsov

unread,
Jun 23, 2013, 6:24:14 AM6/23/13
to open...@googlegroups.com
While I was fixing AspectJ I have run the profiling code a few times. It
turns out that the interpolation to grid is by far the slowest part of
the simulation.

This particularly striking with the Cloud in Cell (CIC) interpolator.
One has to increase the number of iterations a bit (200 should do) to
see it. But we are talking about figures like 67% of the computation
time that is used by interpolation to the grid!

Interpolation to the particles uses about the same percentage as
particle push (< 20%). Note that I was using Boris Solver not the simple
Euler. Field calculation on the other hand was less than 1%.

But why is the interpolation to grid so expensive? Its a very simple
algorithm...I imagined it to suffer when one uses multiple threads and
hight particle densities because the add current methods are
synchronized. But this is happening in the single threaded version too!

We have 8 current writes in total in the CIC interpolator, if one
removes 4 the figure goes down to 47%. Interestingly this is about the
same percentage as with the ChargeConservingCIC which is far more
complex but also uses only four writes in the usual case.

My question is: why are writes so expensive in this situation? We have
four writes in every particle solver algorithm too but they are not that
slow. And why is this happening in the single threaded version? As far
as I understand the fact that the methods are synchronized should not
make any difference here.

Jan Kis

unread,
Jun 24, 2013, 2:08:53 PM6/24/13
to open...@googlegroups.com
My question is: why are writes so expensive in this situation? We have four writes in every particle solver algorithm too but they are not that slow. And why is this happening in the single threaded version? As far as I understand the fact that the methods are synchronized should not make any difference here.

Synchronization always costs something, even in single threaded applications as locks have to be acquired. To see the overhead of the synchronization you can compare the absolute times with and without the synchronized keyword in a single threaded version. 




--
You received this message because you are subscribed to the Google Groups "OpenPixi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openpixi+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



Kirill Streltsov

unread,
Jun 24, 2013, 7:15:39 PM6/24/13
to open...@googlegroups.com
Wow I wasnt expecting that.

I just did some test runs...with 10^4 steps and 10^4 particles it takes 26s for synchronized single threaded, 17s synchronized four threads and 13s for non-synchronized single thread. But my laptop is not that fast and it actually just has 2 cores with hyperthreading.

Considering this maybe it would be better to keep the particles in the cells? S.t. each cell has its particle list and when a particle crosses a boundary its send to the other cell? Or do the distributed version all the way and split the simulation area among the threads?
To unsubscribe from this group and stop receiving emails from it, send an email to openpixi+u...@googlegroups.com.

Andreas Ipp

unread,
Jun 25, 2013, 12:57:47 PM6/25/13
to open...@googlegroups.com
Can you give us the exact parameters you used to perform your test runs? Then we can test it on other machines as well.

(Ideally, it would be nice if also the parallelization could be specified in the xml file :) )

One could try a comparison test run using Jan's parallelization routines to split the simulation area into four pieces for the four threads and compare the timing there.

Jan Kis

unread,
Jun 25, 2013, 1:50:41 PM6/25/13
to open...@googlegroups.com
The interesting questions are:

How many threads did you use?
How large was the grid?

Kirill Streltsov

unread,
Jun 25, 2013, 2:44:20 PM6/25/13
to open...@googlegroups.com
I have used the defaults. That is 100x100 simulation with 10x10 cells.
The only thing that I have changed was the number of steps and the
number of particles. Both parameters were set to 10^4.
I have used 4 threads because I have a Core i7 dual core processor with
hyperthreading. I have tried 2 threads some time ago and I recall it
being a bit slower.

I guess the performance depends on the amount of cells (more is better)
and threads and maybe even on how the particle list is ordered. And of
course on the cpu itself. But its not obvious what parameters would be
best and whether they correspond with physically reasonable parameters.

I guess the distributed version would need to be changed a bit to make
the test. Here is a proposal that is based on Jan's distributed model
but simplified because we dont need all the distributed funtionality here:

We determine the number of threads to be t. The grid runs through its
constructor but adds a new method "createOverlayGrid". This method would
create t OverlayGrid classes that each hold their own cell lists with
the propper offsets in the index() method. These lists merely reference
the original list that is in the real grid, BUT also add a special class
of overlay cells to the boundaries of each overlay grid. These cells
would be very lightweigt. They only need to contain the Jx and Jy variables.

In the ParticleGridInitializer (or even in the settings class) the
particle list is sorted by the particle position and split into t lists.
Then proper boundary regions are set up in the particle boundaries. If a
particle crosses a "thread boundary" it gets copied over to the other
list. This of course would also have to be done when a particle crosses
a periodic boundary.

There would need to be two types of particle iterators. One that ignores
the four different lists and does what it does now with a single list
and a special one for the interpolateToGrid() method. This one takes
each list and assigns it to a specific thread.

The interpolation iterpolates to the overlay grid. After the
interpolation is complete the values in the overlay cells get added to
the ones in the real grid.

The rest of the simulation remains the same. One could even do load
balancing by shifting the boundaries dynamically. (though the list sorts
might be expensive) And since this is just a simplified version of the
distributed version most of the code is already there. :-)

PS: If we keep the current implementation we could introduce a single
addJ method with two arguments theredy reducing the number of method
calls by a factor of two in the CIC and by one call in the Charge
Conserving CIC.

Andreas Ipp

unread,
Jun 26, 2013, 10:25:10 AM6/26/13
to open...@googlegroups.com
@Kirill, this sounds like some major effort. What would be the minimum code change required?

@Jan, is it possible to use your framework with threads instead of using the IBIS framework? How much code change would be expected?

Andreas Ipp

unread,
Jun 26, 2013, 10:44:35 AM6/26/13
to open...@googlegroups.com
I just wanted to test the parallel version [1] and noticed it is broken because the class Particle was moved recently to a different package.

Kirill, will you have time to fix this in the near future?

[1] https://github.com/openpixi/openpixi/wiki/Getting-Started-with-Distributed-Version

Jan Kis

unread,
Jun 26, 2013, 1:41:55 PM6/26/13
to open...@googlegroups.com
I do not think that is a good idea. It would be so complex that it will only be slower. Instead, I would focus on the root cause and that is the fact that the interpolation to grid probably tries to access the same cells from different threads. That means that ideally we would like to parallelize the interpolation per grid.

However, before diving into implementation, I suggest to find out where exactly the bottleneck is (which step of the simulation or even which method - is it the addJx method?). I also suggest to make measurements on different machines. E.g. VSC2 which has more powerful machines.  


--

Kirill Streltsov

unread,
Jun 26, 2013, 5:07:24 PM6/26/13
to open...@googlegroups.com
@broken distributed: I don't really have time for that now...I don't
know how to get the distributed part of the project to get recognized in
eclipse, nor do I have IBIS up and running. Andreas, if you already have
IBIS installed, try changing

import ...physics.Particle;
to
import ...physics.particles.Particle;

Since there is no particle creation in the distributed version that I
know of, this should fix it.

@causes: the bottleneck(s) are definitely the addJx and addJy methods.
If I remove two of the addJx and two of the addJy methods in the CIC
interpolator (making it similar to the charge conserving version) the
time consumed (for the interpolation step) drops from ~64% to ~48%.
I can not tell whether its the lock acquisition that is the issue or
whether the threads are blocking each other. The latter certainly
depends on the particle density and how the iterator loops through the
particle list.
But I think that the previously stated data (26s synchronized SINGLE
thread, 13s non synchronized SINGLE thread) points to the conclusion
that the problem is the acquisition of a lock.

As I proposed before, one could reduce the lock acquisitions by
combining addJx and addJy to one method addJ. But that doesn't really
fix the problem.

One could also just remove the parallelization of this step (and the
synchronized keywords with it).

A parallelization through the cell iterator would still require some
looping through the particles. One could somehow split the algorithms in
a step where the nearest cell is determined (this is done now anyway)
and then use the cell iterator where each cell loops through the
particles and picks out those that belong to it. But thats actually very
complicated because we write to multiple cells. And its algorithm
dependent. It would introduce complexity in the algorithm itself where
we do not want it.
Actually this issue is discussed here
http://dx.doi.org/10.1016/j.jpdc.2008.05.009 (start from section 4).
They end up using a cell based iteration but with a (very) sophisticated
particle list sorting algorithm.
@ALEXANDRU, if you are reading this, this might be interesting for you
because they are doing it on the GPU!

Just using the distributed version seems like overkill to me and IBIS is
an annoying dependency.

I do not see the problem with my proposal though. The only complexity
that it introduces is during the initialization. In terms of overhead
its also just the copying of particles. But unlike in the distributed
version we would not actually copy the particles but merely the
references which should be rather fast, right?
Sure, there are load balancing issues just like in the distributed
version. At worst we are back to single threaded performance but we will
not be slower.
In terms of implementation it shouldn't be too difficult because all the
code is already there. The boundary routines are very general already
and wouldn't need to be changed. We also have the methods to create a
grid with special boundary cells. There might be some subtleties that I
just don't see atm. But there wont be anything major because all the
additional stuff we create only affects the interpolation to grid step
everything else would remain as it is now.

But I think its no point arguing because nobody has time to implement
it, right?

So the most practical approach would be to disable the parallelization
for this particular method for now. And if Alexandru finds the time he
can implement the stuff mentioned in the paper. The only thing I don't
quite get yet is whether that sorting is interpolation-algorithm
dependent or not.

PS: Wow there are a lot of papers on PIC parallelization, especially on
GPU and interpolation stuff (check google scholar).
> <mailto:openpixi%2Bunsu...@googlegroups.com>.

Andreas Ipp

unread,
Jun 27, 2013, 6:52:14 AM6/27/13
to open...@googlegroups.com
I fixed the distributed version: 
https://github.com/openpixi/openpixi/pull/62

Below you find the output I get from the parallel version using IBIS with default parameters on my local computer. So indeed, one should look more closely into the "interpolate to grid" routine.

4 parallel nodes:

  2.639s 100.00% ..... Simulation time
  0.300s 11.37% ..... Push particles time
  1.917s 72.63% ..... Interpolate to grid time
  0.028s  1.04% ..... Solve fields time
  0.389s 14.75% ..... Interpolate to particle time
  0.139s  5.26% ..... Arriving particles waiting time
  0.544s 20.63% ..... Ghost particles waiting time
  0.043s  1.63% ..... Ghost cells waiting time


2 parallel nodes:

  2.607s 100.00% ..... Simulation time
  0.434s 16.63% ..... Push particles time
  1.712s 65.68% ..... Interpolate to grid time
  0.026s  0.98% ..... Solve fields time
  0.432s 16.56% ..... Interpolate to particle time
  0.025s  0.94% ..... Arriving particles waiting time
  0.075s  2.90% ..... Ghost particles waiting time
  0.010s  0.40% ..... Ghost cells waiting time


1 parallel node:

  2.742s 100.00% ..... Simulation time
  0.491s 17.91% ..... Push particles time
  1.727s 62.99% ..... Interpolate to grid time
  0.023s  0.86% ..... Solve fields time
  0.497s 18.13% ..... Interpolate to particle time


Reply all
Reply to author
Forward
0 new messages