@broken distributed: I don't really have time for that now...I don't
know how to get the distributed part of the project to get recognized in
eclipse, nor do I have IBIS up and running. Andreas, if you already have
IBIS installed, try changing
Since there is no particle creation in the distributed version that I
know of, this should fix it.
@causes: the bottleneck(s) are definitely the addJx and addJy methods.
If I remove two of the addJx and two of the addJy methods in the CIC
interpolator (making it similar to the charge conserving version) the
time consumed (for the interpolation step) drops from ~64% to ~48%.
I can not tell whether its the lock acquisition that is the issue or
whether the threads are blocking each other. The latter certainly
depends on the particle density and how the iterator loops through the
But I think that the previously stated data (26s synchronized SINGLE
thread, 13s non synchronized SINGLE thread) points to the conclusion
that the problem is the acquisition of a lock.
As I proposed before, one could reduce the lock acquisitions by
combining addJx and addJy to one method addJ. But that doesn't really
fix the problem.
One could also just remove the parallelization of this step (and the
synchronized keywords with it).
A parallelization through the cell iterator would still require some
looping through the particles. One could somehow split the algorithms in
a step where the nearest cell is determined (this is done now anyway)
and then use the cell iterator where each cell loops through the
particles and picks out those that belong to it. But thats actually very
complicated because we write to multiple cells. And its algorithm
dependent. It would introduce complexity in the algorithm itself where
we do not want it.
Actually this issue is discussed here
(start from section 4).
They end up using a cell based iteration but with a (very) sophisticated
particle list sorting algorithm.
@ALEXANDRU, if you are reading this, this might be interesting for you
because they are doing it on the GPU!
Just using the distributed version seems like overkill to me and IBIS is
an annoying dependency.
I do not see the problem with my proposal though. The only complexity
that it introduces is during the initialization. In terms of overhead
its also just the copying of particles. But unlike in the distributed
version we would not actually copy the particles but merely the
references which should be rather fast, right?
Sure, there are load balancing issues just like in the distributed
version. At worst we are back to single threaded performance but we will
not be slower.
In terms of implementation it shouldn't be too difficult because all the
code is already there. The boundary routines are very general already
and wouldn't need to be changed. We also have the methods to create a
grid with special boundary cells. There might be some subtleties that I
just don't see atm. But there wont be anything major because all the
additional stuff we create only affects the interpolation to grid step
everything else would remain as it is now.
But I think its no point arguing because nobody has time to implement
So the most practical approach would be to disable the parallelization
for this particular method for now. And if Alexandru finds the time he
can implement the stuff mentioned in the paper. The only thing I don't
quite get yet is whether that sorting is interpolation-algorithm
dependent or not.
PS: Wow there are a lot of papers on PIC parallelization, especially on
GPU and interpolation stuff (check google scholar).