Parallelization strategy

Alexandru Petcu

unread,

Jun 24, 2013, 3:33:58 PM6/24/13

to open...@googlegroups.com

Hello, after a talk with Andreas we decided for the next strategy regarding the parallelization of the project:
   The initial plan was to take each routine(that is worth parallelizing) at a time and try to write the OpenCL kernel for it, and when
the routine would be called it would run on the GPU, but this proved to be extremely inefficient due to the overhead of the data
transfer between the GPU and the host device. This behavior can be observed in the small demo I wrote when I did my application
for GSoC(*). As you can see at each step the list of particles is transferred on the GPU, where the Boris calculations are done,
and then the particles are transferred back on the host device, so for N iterations we would have 2*N data transfers.
   Due to this we decided that it would be best to keep the list of particles on the GPU and perform the required calculations
in a single kernel.Practically I will write a kernel that runs an entire simulation scenario.For this I will have to identify what different types
of scenarios could exist.For example: -one scenario would contain: the ChargeConservingCIC interpolator algorithm, PeriodicBoundary, Boris solver etc.
                                                       -another scenario: the ChargeConservinCIC interpolator algorithm, PeriodicBoundary, BorisDamped solver etc.
                                                       -etc.

The challenge with this approach is identifying all the steps that a particle should go through in order to complete a step of the simulation(....-->interpolation-->...-->Boris routine-->...).
It would be helpful if someone who knows the project very well could list these steps a particle goes through in order to complete a step of the simulation, maybe with code reference.

(*)https://github.com/PetcuAlexandru/OpenCL_Demo/blob/master/OpenCL_Demo/src/main/java/org/openpixi/pixi/physics/solver/BorisCL.java

Kirill Streltsov

unread,

Jun 24, 2013, 6:55:03 PM6/24/13

to open...@googlegroups.com

Hi Alexandru!

Sounds nice! Did I understand you correctly that you want to run the whole Simulation.step() method on the GPU and hold the particle list as well as the grid in the VRAM?

Here is a quick review of the steps a particle takes during one iteration:

1) ParticleMover.push()
    *) This method gets the particle list, takes care of the parallelization and executes the Push ParticleAction on each particle. This action is defined in the private Push class that is also in the ParticleMover.
    *) Push.execute(): first the particle position is stored in the prevX and prevY variables in the particle. This is needed for the charge conserving cic interpolation algorithm. Then the Solver.step() is called where the particle position is advanced by one time step. Then the boundary check is performed.
    *) In the ParticleBoundaries.applyOnParticleCenter() method, which is implemented int he SimpleParticleBoundaries class, first the region (left, right etc) is determined and then the appropriate ParticleBounadry is called. That means that boundaries can be different on different sides of the simulation.
    *) In the periodic case the particle boundary simply adds the appropriate value to the x, y, prevX and prevY variables of the particle.
    *) The hardwall boundary first calls the Solver.complete() method, then performes the reflection and then calls the Solver.prepare() methods. In the solver methods the velocity is shifted by half a timestep.

Note: in the distributed version there can also be internode boundaries. This is documented on Jan's blog: http://karolovbrat-gsoc2012.blogspot.sk/ I guess one would split work among multiple gpus in a similar way.

2) Interpolator.interpolateToGrid()
    *) This class is similar to ParticleMover.push() it handles the parallelization and then calls InterpolatorAlgorithm.interpolateToGrid() on each particle. This method then goes either to the Cloud in Cell algorithm or the Charge Conserving CIC. They interpolate the velocity of the particle to the grid according to its relative position to the grid cells. This is the slowest part of the simulation at the moment. Because the setters on the grid cells need to be synchronized methods.

3) Interpolator. interpolateToParticles()
    *) After the fields on the grid were updated they get interpolated to the particle. Similarly as above but the interpolateToParticle() method is only implemented in the Cloud In Cell algorithm, the charge conserving cic uses the same implementation because its actually just an extension of the Cloud In Cell algorithm.

Thats it. This is where most of the simulation happens. The field update is very quick and is done in the SimpleSolver algorithm.

In principle we can use all combinations of algorithms. In practice we will probably stick to: SimpleSolver, ChargeConservingCIC, Boris. At the moment the parallelization is hidden away in the interator classes. The physicists need only to implement the algorithm that should be performed on a single particle or grid cell. Do you think this could be preserved with OpenCL? Then you might not need to deal with the inner workings of the algorithms...

The problem with the algorithms is that thy store a lot of data in the particle that is not needed all the time. The Charge Conserving CIC for example stores the prevX and prevY variables. One could perform the push and the interpolation to the grid in one step and make push store the data for the interpolator temporarily. That way we only would need to store 2*number of threads of doubles for some time instead of 2*umber of particles all the time. Maybe this temp data even fits in some cache and does not need to be transfered...:-)
Same argument could be applied to the prevForce variables that are stored by te solver for the complete() method.

The problem is to let push know what kind of data it should store because this depends on the interpolation algorithm and the solver algorithm. I am just mentioning it in case memory and bandwidth is a problem. A lot of optimization can be done here! One just needs to find a good way to store some temporary data in the right places.

Cheers
Kirill

Alexandru Petcu

unread,

Jun 25, 2013, 3:30:13 PM6/25/13

to open...@googlegroups.com

Hello, thank you for providing the steps of the simulation.

>Sounds nice! Did I understand you correctly that you want to run the whole Simulation.step() method
>on the GPU and hold the particle list as well as the grid in the VRAM?

Yes, that's right.As I explained in my first post this will eliminate
the overhead of the data transfer between the GPU and the host device and hopefully will result in a good
performance of the parallel version.

>The physicists need only to implement the algorithm that should be performed on a single particle or grid cell. Do you think this could be preserved with OpenCL?
>Then you might not need to deal with the inner workings of the algorithms.

My plan is to run the simulation on the entire list of particles in order to get a general
idea about the performance that can be obtained with the parallel version.
I think that once this will be done it shouldn't be difficult to adapt the OpenCL kernel
to work on a single particle.Keep in mind that OpenCL parallelization means
writing kernels(they're similar to normal functions, written in C) that do the same thing
as the Java code, except they will run on the GPU.This means that they are
somewhat independent from the Java code so there is no problem in adding one or more kernel
s depending on one's needs. This is also another reason for my
parallelization strategy choice. When one will want to add a new routine to the simulation, in order
to include it in the parallel version all he has to do is to add it
to an existing kernel, or write a new one based on the existing ones, without needing much knowledge of OpenCL, if any at all.

Thank you,
Alexandru Petcu

Reply all

Reply to author

Forward