On the first week or two I familiarized myself with OpenCL(*) and the OpenCL binding for Java, JavaCL(**).The initial plan was to browse the project and find routines that are worth
parallelizing and try to run this routines on the GPU by writing an OpenCl kernel for every one of them, which means that at anytime during a simulation, one of this routines would be called it would be moved on the GPU and runned on multiple threads in parallel.This proved to be very inefficient because when such a routine was called, in order to run on the OpenCL device, the data used by this routine should also be transferred on the device with the appropriate conversions.That means that at every step of the simulation there would be a set of data transfers between the host device and the OpenCL device plus the necessary conversions which added a considerable time overhead. In order to avoid this I decided to write an entire simulation in a single kernel. The disadvantage of this approach is that for every simulation scenario there should be a separate kernel. A simulation scenario consists of a set of routines that are called during that simulation. One example of a scenario would be a simulation that among others it uses the Charge Conserving CIC interpolation algorithm, another scenario would use the Cloud In Cell algorithm.
To do this I first identified what are the steps of a simulation and added them to the OpenCL kernel. Trying to run the simulation I obtained incorrect results and more than that the results were different for every separate run. This indicated to me that there were some synchronization problems between the threads. When one thread was in the first step of the simulation maybe another thread was already at the third step, for example, so what I needed was that every thread that finishes a step waits for the other threads to finish the same step before moving on to the next step. My first idea was to use a barrier at the end of each step. Unfortunately OpenCL provides barriers that work only inside a workgroup, while I needed a global one for all workgroups. I tried to create one myself but it didn’t work and also affected the performance of the program. The solution was simple when I found out that OpenCL keeps all global memory on the device until it’s manually released which means I can write different kernels that work with the same data which is kept on the device. Knowing that JavaCL provides an event based mechanism for inter-kernel synchronization I split the simulation in smaller kernels and associated every one of them an event. I chained this kernels so when one finishes it signals the kernel that was waiting for this event which does the same thing until the last kernel finishes its execution. This approach also solves the problem described above (for every scenario there should be a different kernel) because this way I can write all the kernels I need and for every scenario I just select the ones I want.
I will now give a description of the code I wrote, the problems I encountered and the solutions I found. As I said before the first thing I did was to identify the steps of the simulation and write the corresponding kernels. The parallelization consists of having a different thread for each particle, so in each thread the simulation algorithms are applied to that particle.
The first step of the simulation is the “particle push” step. First I convert the list of particles so I transfer it on the device. OpenCL kernels are written in a limited version of C language, C99 standard so that means I cannot transfer Java objects to the kernel. I had to convert the list of particles in an array of doubles and put every attribute from the Particle class in that array. For a 100 particles with 22 attributes that means the array will have 2200 elements, with the first 22 corresponding to the first particle and so on. I did this for all the other objects I transferred on the device(force, cells etc.). Unfortunately there is no other way to do this and this approach makes it easy to make mistakes and write the wrong index when you want to access an element from the array. I lost a lot of time correcting this type of errors.
In this kernel I applied the Boris.
The second step and the most difficult is the interpolation(Charge Conserving CIC). The biggest problem I had here was caused by the way cells were created in the Java code. Some of the cells that were created had the same reference. In Java this means that any modification done on one object will be visible on all the objects that share its reference. Since on the kernel the cells are held in an array I cannot put more cells at the same address. Initially I tried to determine what cells have the same reference after the interpolation, so I could update them accordingly, but doing this meant performing the same operation at every iteration. Instead I decided to do this in the host code and just pass it to the kernel. I created something like a map for the cells that share the same reference. All the cells with the same reference will have the same mark so for example if cells[x][y] = cells[a][b] => mark[x][y] = mark[a][b]. When I modify a cell on the kernel I modify all the other cells that have the same mark.
The third step is the “grid update” step. Same as before I had to be careful to update all the cells with the same mark.
The final step “interpolate to particle” gave me no mentionable problems.
You can find my fork of the project here(***).
I also started writing a C version of the project which I plan to parallelize using OpenMP because it’s a thread based parallelization which means there won’t be any data transfers as memory is shared between the threads of the same process. For now I implemented the required data types and helper functions and I wrote the “push” step. The project can be found here(****).
I want to mention that I did all this with help of my mentor, Andreas.