Parallelization - problem analysis

Jan Kis

unread,

May 27, 2012, 7:33:20 PM5/27/12

to open...@googlegroups.com

Hi Guys,

this weekend I read couple of papers on parallelization and consequently, I tried to tidy up a bit the different thoughts we mentioned in our last skype conference. You can view the result here:

http://karolovbrat-gsoc2012.blogspot.com/2012/05/week-01-parallelization.html

Feel free to comment on anything you disagree with either here or directly at my blog.

Cheers,

Jan

Andreas Ipp

unread,

May 28, 2012, 7:39:47 AM5/28/12

to open...@googlegroups.com

Hi Jan,

Thanks for sharing your thoughts. Indeed, grid decomposition and particle decomposition can be both useful approaches. Maybe even a mixture of the two could be used for the best performance (for example use GD across several CPUs, and PD within a single CPU).

For the problem analysis, I would also add:

* Coding difficulty:

* PD: Straightforward. Basically already done during the GSoC application period by your demo app.

* GD: Hard. Requires a lot of code changes.

Here are some requirements that I could think of:

* It would be good if the code structure would allow the physicist to implement the physics equations without needing to know whether GD or PD or a mixture of both is used. Also the implementation of more advanced algorithms should be easy.

* The implementation should be flexible enough to allow experimenting with different size GDs on any number of CPUs (starting from 1), with or without optional PD.

* Dynamical load balancing would be great, but should again be fully transparent to the simulation. The physics simulation should not need to care about how it is parallelized.

Given that you already solved PD in your demo app and given that GD is a hard problem, I would suggest to start with GD early and see how far we can get.

Regarding dynamical load balancing:

* This depends on the problem one wants to study. This will not be an issue if particles are mostly randomly distributed like in a free gas of particles and there are only small deviations that one wants to study (e.g. plasma waves). In extreme cases (e.g. large electric field that pulls all particles to one side of the simulation) this could be a bigger issue.

* Therefore this is something one could add later if necessary, but if it is not a big problem, one could prepare the code structure so that it would not be difficult to implement it later.

Another possibility I want to throw in is not only dynamical load balancing, but dynamical grid size. There could be one region with fewer particles that needs less resolution (e.g. 8 x 8 grid) while a neighboring region has many particles and could use higher resolution (e.g. 32 x 32 grid for same physical size). Connecting the two regions could be as "simple" as devising the right boundary conditions that would interpolate from the 8 rows from one side to 32 rows on the other side, and back. So the 8x8 region doesn't know what happens on the other side. All it sees is the boundary grid that behaves like just another 8x8 neighbor, while the boundary transparently interpolates to a 32x32 neighboring grid.

As mentioned, on a single CPU, GD could have significant advantage if particle collision is activated. Here boundary conditions could transparently write to the neighboring grid directly so that no additional memory copy is necessary. (ghost cells would just directly write to cells in neighboring grids, instead of to a temporary memory).

* Grids could be connected in 2 or 3 dimensions (either 4 or 6 neighbors).

* If grids of different sizes are considered, the number of neighbors could vary even more: a large 32 x 32 grid of low activity could be connected to two 16 x 16 grids on the left side with high activity. The boundary condition would connect the upper half of the 32 ghost row with one 16x16 grid, and the lower half of the 32 ghost row with the other 16x16 grid.

* In each direction the boundary could be some physical boundary (hardwall, thermal, ..) or some "topological" boundary (periodic boundary, neighboring boundary, boundary to a different CPU, ...). Grids and boundaries could be moved for dynamical load balancing.

So it is clear that a very flexible design for boundaries and grid connections would allow a lot of possibilities and extensions. But it is probably also difficult to come up with a good and simple design.

But it seems to me, the GD problem really lies at the heart of a good PIC parallelization. Once a very good and flexible basic structure is in place, everything else could be built on top of it.

Would you agree with this approach?

Thinking about all these different possibilities (all of which we won't implement right away, but just to keep in mind possible extensions), I'm thinking of the following strategy:

* A grid is always attached to 4 boundaries (or 6 boundaries). If we take corners into account, it is 8 boundaries (in 2 dim) and 26 boundaries in 3 dimensions. Boundaries can be anything. Grids themselves don't know how they are connected to other grids, or whether they are connected to other grids at all. All they see is a fixed number of boundaries.

* Connection is only done from one boundary to another boundary. It is not that grids are attached to each other, but grids are attached to boundaries, and boundaries are attached to each other. How complex the link is is up to implementation of the boundary, and can be easily extended by introducing new kinds of boundaries. A boundary may write directly into the array of a neighboring grid (neighbor or periodic boundary condition), or it may send the information to another CPU, or it may process the information prior to sending it to another CPU (interpolate to different grid size), or split the information (send information to 2 different CPUs), or do something physical (hardwall boundary, thermal boundary), or anything else one could think of in the future.

* On top of it, the whole structure should be lightweight so that a grid decomposition into many small logical grids is possible without much performance penalty (e.g. by writing directly to neighboring cells where possible).

Does this sound like a reasonable layout?

Cheers,

Andreas

Andreas Ipp

unread,

May 28, 2012, 11:49:56 AM5/28/12

to open...@googlegroups.com

As a consequence, Grid itself doesn't need to be extended by a ghost range, but calling e.g. setBx(-1, -1, ..) would already call the corresponding function in the boundary. (so it seems, Boundary has a similar interface as Grid?)

(if it is periodic boundary condition, boundary would know to what grid it is connected, and call setBx(15, 15,..) on that grid, which would cause the field to be set directly).

Maybe even Grid itself doesn't need to contain the fields, but there could be a large field (256x256), and one grid accesses a subset of that field (maybe 16 x 16).

Hm, thinking about this even further, Grid is just another kind of boundary - a boundary that is connected to several other boundaries at the border, and directly accesses the fields in its interior. Or maybe this concept is already too abstract? (or vice versa: Boundary is just a special kind of Grid)

Also, boundaries for particles and fields could be independent of each other, so that one could for example have a periodic boundary for fields and a hard wall boundary for particles at the same place (for example a thin transparent sheet where particles are reflected, but fields go through). In any case, it would be good if these two types of boundaries could be independent so that one could be improved without affecting the other.

Jan Kis

unread,

May 28, 2012, 12:34:39 PM5/28/12

to open...@googlegroups.com

Hi Andreas,

thanks a lot for a very fast feedback.

General Ideas

Thanks for sharing your thoughts. Indeed, grid decomposition and particle decomposition can be both useful approaches. Maybe even a mixture of the two could be used for the best performance (for example use GD across several CPUs, and PD within a single CPU).

That is exactly what I had in mind. Since all the papers use GD across several CPUs and its only problem is load imbalance (which as you mentioned does not have to be necessarily present as it does depend on the problem one is studying).

As mentioned, on a single CPU, GD could have significant advantage if particle collision is activated.

I agree. I do not promise anything but currently it seems to me that on a single CPU it should not be much of an issue to try GD as well as PD.

Another possibility I want to throw in is not only dynamical load balancing, but dynamical grid size. There could be one region with fewer particles that needs less resolution (e.g. 8 x 8 grid) while a neighboring region has many particles and could use higher resolution (e.g. 32 x 32 grid for same physical size). Connecting the two regions could be as "simple" as devising the right boundary conditions that would interpolate from the 8 rows from one side to 32 rows on the other side, and back. So the 8x8 region doesn't know what happens on the other side. All it sees is the boundary grid that behaves like just another 8x8 neighbor, while the boundary transparently interpolates to a 32x32 neighboring grid.

This is very interesting thought. Again, I am not promising anything but I believe that this kind of extension should not be too hard to add if we come up with a clever grid boundary design.

Requirements

* It would be good if the code structure would allow the physicist to implement the physics equations without needing to know whether GD or PD or a mixture of both is used. Also the implementation of more advanced algorithms should be easy.

* Dynamical load balancing would be great, but should again be fully transparent to the simulation. The physics simulation should not need to care about how it is parallelized.

I would very much like to hide all parallelization details so that a physicist trying to implement new field solver, interpolator or particle solver does not have to even think about the application running in parallel. It would be best if the physicist could think about the application as a single threaded thing.

Load Balancing

* If grids of different sizes are considered, the number of neighbors could vary even more: a large 32 x 32 grid of low activity could be connected to two 16 x 16 grids on the left side with high activity. The boundary condition would connect the upper half of the 32 ghost row with one 16x16 grid, and the lower half of the 32 ghost row with the other 16x16 grid.

* In each direction the boundary could be some physical boundary (hardwall, thermal, ..) or some "topological" boundary (periodic boundary, neighboring boundary, boundary to a different CPU, ...). Grids and boundaries could be moved for dynamical load balancing.

I was also thinking about these two techniques when thinking about load balancing. While they seem as great solutions they have couple of disadvantages.

1) Although they achieve load balance in number of particles the node processes, they create imbalance in number of grid points the node calculates in the field solving.

2) They both would have many implementation issues complicating the design quite a lot.

3) In the first one, as the particles move we would need to dynamically reassign grids of different sizes to different nodes.

4) In the second approach we need to take into account the communication overhead connected to transferring the whole part of grid from one node to the other. Furthermore, one node could end up with disconnected regions of grid which would cause unnecessary communication.

Further Strategy

Thinking about all these different possibilities (all of which we won't implement right away, but just to keep in mind possible extensions), I'm thinking of the following strategy: ...

I think pretty much the same way. The design of the good boundary and grid should be now of the highest importance. I also think that we (more I than we:)) should start with the GD scenario across several CPUs as it is the hardest part and would require the most design changes to the current code base.

As the very next steps I suggest we agree on some simplified scope of what we want to implement so that it is feasible. Then I can start thinking of the design of the grid which will allow it to have the flexible boundaries. After we have the design I can go ahead and implement it :)

To simplify things I suggest to use at the beginning the following restriction on the grid of the simulation. The grid can only have dimensions of 2^n (eg. 2x8, 16x16, 64x128). This way we can easily create equal subgrids and do not have to come up with complicated algorithms of splitting the grid. Of course then in the end one node might compute more subgrids depending on the number of nodes available.

Cheers,

Jan

Andreas Ipp

unread,

May 28, 2012, 1:30:59 PM5/28/12

to open...@googlegroups.com

Hi Jan,

Further Strategy

The design of the good boundary and grid should be now of the highest importance. I also think that we (more I than we:)) should start with the GD scenario across several CPUs as it is the hardest part and would require the most design changes to the current code base.

Very good.

To simplify things I suggest to use at the beginning the following restriction on the grid of the simulation. The grid can only have dimensions of 2^n (eg. 2x8, 16x16, 64x128). This way we can easily create equal subgrids and do not have to come up with complicated algorithms of splitting the grid. Of course then in the end one node might compute more subgrids depending on the number of nodes available.

That's an acceptable restriction, although I don't really see its necessity at this stage. Ok, you probably want to avoid one node calculating 2x5 and the other node 3x5 so that one node doesn't have to wait for the other node all the time, but apart from the wasted idle time, it shouldn't be a conceptual problem to allow for arbitrary sizes?

Before we start to plunge into the GD problem, we should first fix all bugs mentioned in the other discussion thread / pull request, so that we have a good single threaded reference simulation. We could tag that and publish it as version 0.4. Jan, could you have another close look at the refactored code and see if you can help Kirill to get all tests running? Don't hesitate to ask here if there is something you don't understand. If Kirill is happy with the changes, we can push out version 0.4, and start working on 0.5 - code name "GD" :-)

Andreas

Jan Kis

unread,

May 28, 2012, 2:12:44 PM5/28/12

to open...@googlegroups.com

That's an acceptable restriction, although I don't really see its necessity at this stage. Ok, you probably want to avoid one node calculating 2x5 and the other node 3x5 so that one node doesn't have to wait for the other node all the time, but apart from the wasted idle time, it shouldn't be a conceptual problem to allow for arbitrary sizes?

It is definitely not a conceptual problem :) I just believe that it is a simplification which will allow us when thinking about the design to concentrate on more important things like the boundary across two grids on different nodes

Before we start to plunge into the GD problem, we should first fix all bugs mentioned in the other discussion thread / pull request, so that we have a good single threaded reference simulation. We could tag that and publish it as version 0.4. Jan, could you have another close look at the refactored code and see if you can help Kirill to get all tests running? Don't hesitate to ask here if there is something you don't understand. If Kirill is happy with the changes, we can push out version 0.4, and start working on 0.5 - code name "GD" :-)

Right you are. Till our skype conference on Wednesday I will try to get the tests running.

Kirill Streltsov

unread,

May 28, 2012, 3:56:13 PM5/28/12

to open...@googlegroups.com

I think that I have mentioned it earlier that the issues with the PoissonSolver are not related to the refactoring that Jan has done. The problem is to write a unit test that actually tests what one wants. Or implement the solver in a different way to give results that are independent of the grid scaling...But that are physical problems.

And I think that I have just found the (very stupid) error in the interpolator: I use g.setRho(...) instead of g.addRho(..) because of course multiple particles can write to the same point...

Will try to fix it and post again...

Andreas Ipp

unread,

May 29, 2012, 3:49:37 AM5/29/12

to open...@googlegroups.com

Idea: I had another idea that could be considered prior to the design phase:

Would it be possible / useful / feasible to create a general parallelization framework for any kind of simulation in 2 or 3 dimensions that assumes some kind of locality (i.e. no long-range forces but only interactions in a small region involving at most the direct neighbors)?

This could be designed as a separate library on its own, which is versatile and doesn't need to know anything about the kind of physics that is going on.

Pixi would then be one application that uses this general framework for use with a grid and with particles, but one could imagine other physics simulations that could be based on the very same framework, like fluid simulations, or molecular dynamics, or Ising models, or ...

I now wonder whether such a framework already exists that could be used? Or, if it doesn't exist, is there a reason for its non-existence? (e.g. too abstract, too much overhead, each simulation requires slightly different tweaks, ...?)

Of course, there is MPI (and related) for comparatively "low level" communication, but I wonder if there exists or if one could write a framework at a higher level that would hide all the complications of parallelization and lets one concentrate on the actual physics? At least for the special case of 2D or 3D simulations with local interactions.

Let me know if you think I'm drifting away too much, and we should rather concentrate on a small well-defined problem first and maybe generalize later :-)

Jan Kis

unread,

May 29, 2012, 8:20:04 PM5/29/12

to open...@googlegroups.com

Hi Andreas,

I have given it some thought and I think it would be really hard to implement such a framework so that it satisfies all of the following conditions:

A) is flexible enough so that it can be used also outside pixi on variety of other physics simulations that use locality

B) achieves high performance by

- allowing the user to comfortably overlap communication and computation

- efficiently exchanging data among different nodes

I believe that this would be a task for someone who already parallelized successfully couple of physics simulations :)

Let me know if you think I'm drifting away too much, and we should rather concentrate on a small well-defined problem first and maybe generalize later :-)

The better the problem is defined the more likely we are to succeed :)

Cheers,

Jan

Andreas Ipp

unread,

May 30, 2012, 1:59:10 AM5/30/12

to open...@googlegroups.com

The starting point of the idea was that we already have 2 kinds of "things" that we would need to parallelize: grids and particles.

They share some features:
- there is some region of interest, and neighboring regions.
- you want to abstract the region boundaries so that regions and neighbors behave like a virtual larger region which is what the physics classes get to see.
- regions need to be connected and data shared (either across memory or across CPUs)
- access to regions has to be synchronized, so that the simulation stays in sync and there is no race condition

So there may be some classes and methods that are common to both "things", grid and particles. If there are, one could generalize this common subset from 2 "things" to many "things".

Maybe this thought can be useful when designing the class layout, but I agree it is not of importance now, and if there really is no common subset, then just drop this idea.

Reply all

Reply to author

Forward