Understanding when and how to use the parallel module

190 views
Skip to first unread message

Francesco Biscani

unread,
Oct 29, 2018, 8:38:06 AM10/29/18
to ProjectChrono
Dear all,

I am having troubles understanding how to make the parallel module work well for my specific simulation scenario (just to be clear, I am seeking to increase performance only through the use of multithreading at this stage, no CUDA, MPI, etc.).

In my setup, I have a bunch of convex hulls which are initially well separated and which eventually start to coalesce in agglomerates later in the simulation due to self gravity (I am investigating certain astrophysical phenomena related to planet formation).

In serial mode, everything works fine, however I am limited in the number of particles I can use. With about 1E4 particles, my timesteps at the beginning of the simulation take circa ~150ms, with 1E5 particles that figure climbs to ~1.5 seconds. The timesteps get slower as the simulation proceeds, as the number of collisions increases.

After a bit of fighting, I managed to set up a parallel simulation based on ChSystemParallelNSC (rather than ChSystemNSC). However, the runtime has actually increased rather than decrease: 1E4 particles now take in the order of ~600ms per timestep, with 1E5 particles the runtime seems to scale even worse (this is on a 4 cores / 8 threads machine).

For both the serial and parallel runs, I have touched the least amount of parameters possible to get the simulation running (that is, I am running everything with as much default parameters I can). The only substantial step I needed to take in order to get the parallel simulation running was to create a custom convex hull class which is essentially identical to the standard one (ChBodyEasyConvexHull), with the only difference that it uses a ChCollisionModelParallel rather than the standard (serial?) collision model.

At this point I am a bit at a loss on what to try next, and I have a few questions/comments:

- is the parallel module supposed to become more performant in later stages of the simulation, when the number of collisions increases? If so, is it a reasonable thing to do to run the initial part of the simulation in serial mode and then switch to a parallel simulation?
- what is the relation between functions such as ChSystem::SetParallelThreadNumber(), the type of the system (ChSystemParallelNSC vs ChSystemNSC), solvers which are seemingly multithreaded such as ChSolver::Type::SOR_MULTITHREAD, and so on. Or in other words, what is the recommended way to setup a thread-parallel simulation?
- is the lower performance in multithreaded mode related to the fact that (I believe?) ChCollisionModelParallel is not using the Bullet engine but possibly less efficient collision detection routines? Or is it due to other reasons?
- I noticed that even in parallel simulations my cpu usage  is most of the time fixed on a single core, so it seems like not much parallel processing is actually happening.
- should I switch to a distributed/MPI implementation instead or will it have the same behaviour as the thread-parallel one?
- could I recover better parallel performance by switching to different body shapes (e.g., spheres)?
- I noticed that some features in parallel mode are broken/not available. For instance, I am not ale to recover the AABBs of the bodies in parallel mode (zero AABBs are always returned) and removing bodies from a parallel simulation results in crashes due to out-of-bound accesses (I can elaborate more on this).

Apologies for the long email, and thanks for any help.

  Francesco.

Francesco Biscani

unread,
Oct 29, 2018, 8:48:38 AM10/29/18
to projec...@googlegroups.com
Just for reference, here's a video of the type of simulations I am running:


--
You received this message because you are subscribed to the Google Groups "ProjectChrono" group.
To unsubscribe from this group and stop receiving emails from it, send an email to projectchron...@googlegroups.com.
To post to this group, send email to projec...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Francesco Biscani

unread,
Oct 29, 2018, 12:23:33 PM10/29/18
to projec...@googlegroups.com
I ran some profiling on my simulations, and it turns out that the vast majority of time is spent in the ChCBroadphase::OneLevelBroadphase() function. Within this function, almost all of the time is spent in the two loops:

#pragma omp parallel for
    for (int index = 0; index < (signed)number_of_bins_active; index++) {
        f_Store_AABB_AABB_Intersection(index, inv_bin_size, bins_per_axis, aabb_min, aabb_max, bin_number_out,
                                       bin_aabb_number, bin_start_index, bin_num_contact, fam_data, obj_active,
                                       obj_collide, obj_data_id, contact_pairs);
    }

and

#pragma omp parallel for
    for (int i = 0; i < (signed)number_of_bins_active; i++) {
        f_Count_AABB_AABB_Intersection(i, inv_bin_size, bins_per_axis, aabb_min, aabb_max, bin_number_out,
                                       bin_aabb_number, bin_start_index, fam_data, obj_active, obj_collide, obj_data_id,
                                       bin_num_contact);
    }


E.g., within a typical timestep, ~690ms are spent in total in the OneLevelBroadphase() function, and of these ~360ms are spent in the first loop and ~320ms are spent in the second loop. Since, the number of OMP threads does not seem to make a difference in the runtime, the problem seems to be that the iterations in the loop are doing very uneven amount of works and thus most time is spent within a single thread.

So I dug into the f_Store_AABB_AABB_Intersection() and f_Count_AABB_AABB_Intersection() functions, replaced a couple of counters with std::atomic, added some

#pragma omp parallel for schedule(dynamic)

turned on nested parallelism with omp_set_nested() and now the runtime has reduced to ~250ms from the original ~690ms. I *think* the (now) parallel implementation of f_Count_AABB_AABB_Intersection is safe, but I am not very sure about the other function.

Is there any interest for me to chase this further?



Alessandro Tasora

unread,
Oct 29, 2018, 1:50:38 PM10/29/18
to Francesco Biscani, ProjectChrono

Hi Francesco
Let me comment some points here,  even if the "parallel" version of Chrono is not my contribution (I am mostly involved in the development of the "serial" version).

1
The fast SOR_MULTITHREAD solver,  which is available in the default "serial" Chrono, was experimental and it is non-deterministic in the sense that different processors can give different results (same issue for different runs),  therefore it should not be used for scientific work. Prefer using SSOR or SOR for tests, and BARZILAIBORWEIN for better precision at the cost of slower performance. Note that BARZILAIBORWEIN already uses some OpenMP multithreading,  although I admit that it could be optimized.

2
The serial chrono,  except for those few OpenMP optimizations,  uses mostly one core. This said, I am surprised that the parallel version is loading just one core in your tests...

3
The ChSystem::SetParallelThreadNumber() setting is a bit obsolete. It was used only by that SOR_MULTITHREADING code I mentioned before. Otherwise, the N of threads should be automatically set by OpenMP in the rest of the code.

4
Yes, at the moment not all features of Chrono serial are available in Chrono parallel. Sorry for this issue. In future we'll port more functionalities in the parallel part. Do you need some specific missing feature apart from that AABB stuff that you already mentioned?

By the way: nice work in your video...

Best regards

Alessandro Tasora

--
You received this message because you are subscribed to the Google Groups "ProjectChrono" group.
To unsubscribe from this group and stop receiving emails from it, send an email to projectchrono+unsubscribe@googlegroups.com.

Radu Serban

unread,
Oct 29, 2018, 3:47:25 PM10/29/18
to projec...@googlegroups.com

Hi Francesco,

One of the settings that you should *not* leave at default value is the number of bins used for the broad-phase collision detection.  If I recall correctly, the default is something like 20x20x20 or 10x10x10.  You will likely need quite a bit more than that, but will need to experiment to find the sweet spot.   As a rule of thumb, I suggest you try some values that lead to roughly 2-4 objects per bin per direction (in other words, bins with 8-64 objects each).

Indeed, Chrono::Parallel does not use the collision detection from Bullet.  Instead it uses a custom grid-based algorithm for broad-phase and a combination of Minkovski Portal Refinement + analytical for narrow-phase.   The broad-phase algorithm used in Chrono::Parallel (based on bins on a uniform grid which is assumed to be aligned with the global reference frame) is nowhere near as sophisticated as it should be and works best for objects of roughly the same size and for simulations where there is little "flow" of the collision shapes: situations like those encountered in terramechanics when modeling soil as granular material.  Which means that, for the type of simulations you are doing, a bin setting that works well at the beginning of the simulation may not be all that great when bodies coalesce.

You can in principle adjust the bin setting as the simulation progresses.  However, the current implementation assumes a uniform grid (i.e. bins of equal size in a given direction), so this will anyway be sub-optimal for your problems.   One solution is to allow for a non-uniform grid (and allow the user to control this, in a problem-dependent manner, as the simulation progresses) -- a modification along these lines is on my todo list, but I do not know when I'll get to it.

By the way, you can change the number of bins using something like:
       my_system.GetSettings()->collision.bins_per_axis = vec3(bins_x, bins_y, bins_z);

I'm very interested to see how much you can get out of using a more appropriate binning, at least at the beginning of your simulations, before significant aggregation occurs.  I also expect that you will see diminishing returns from parallelizing functions such as f_Store_AABB_AABB_Intersection() and f_Count_AABB_AABB_Intersection() since those will have little work left per active bin.

Having said that, it would be great if you could share with us the modifications you've experimented with.  In particular, which parallel for loops did you find benefited from dynamic scheduling?   A while ago (must be 2-3 years now) we did some extensive testing trying to find the best scheduling for the various parallel for loops in Chrono::Parallel.  What is implemented right now is the best compromise we converged on at that time, using certain benchmark problems that unfortunately may not be representative of what you are doing. 

You are also correct that Chrono::Parallel supports only a subset of what can be done with the serial Chrono solver (most notably, here is no support for FEA). 
Indeed, removing bodies from a Chrono::Parallel system is not implemented (this has to do with the underlying data structures used in the parallel code); there is some experimental code to allow that, but it had to be commented out due to some bugs.
Also, not all queries you can make for a serial Chrono system can be made for a Chrono::Parallel system, or at least not with the same API.  For your specific question, I think there should be a way to access the current AABBs.  If that is critical for you, I can look into this and give you a definite answer.

Bottom line is that Chrono::Parallel is overdue for a refactoring and revamping.  Feedback such as the one you provided and taking into account applications from domains other than those we typically work with are very useful.  Thank you for this and keep it coming :-)

Best,
Radu

Francesco Biscani

unread,
Oct 30, 2018, 11:48:54 AM10/30/18
to ser...@wisc.edu, tas...@ied.unipr.it, projec...@googlegroups.com
Hello Radu and Alessandro,

first of all, thanks a lot for the explanations!

Radu's suggestion about the binning was the critical insight to regain parallel speedup. What I think was happening was that I had a very low number of objects very far away from the centre of the simulation, and because of this I *think* that essentially all my bodies ended up being grouped in the same bin close to the centre (I believe that's the reason why I could obtain some speedup by parallelising the inner loops). I have now eliminated the far away bodies, increased the number of bins and I am observing a speedup with respect to the serial code. I am still toying around a bit testing different parameters, but this is a huge step forward for my simulations, thanks!

You can see the code I wrote yesterday here:


Basically I just added some extra parallelisation in the inner loops, and tried to make things thread-safe with the help of atomics. Normally I would do such things with TBB and std::atomic (as in my experience TBB works well in nested parallel situations), but I used an OpenMP-based approach in order to be consistent with the surrounding code. I am not sure this code is any more relevant now that the binning is fixed though, and I don't know how well OpenMP dynamic parallelism works in practice...

Currently I am using the AABB information in 2 ways: first as a rough estimate of the minimum body size, which I use in a heuristic to determine the integration timestep. Secondly, I need to track the formation of agglomerates during the simulation, and I do this by reading the AABBs of the bodies and doing some spatial hashing on top of that to determine the clusters of bodies connected by overlapping AABBs (which I suppose it somewhat similar to a broad phase collision detection?). The bodies with overlapping AABBs are my agglomerates, to some approximation. I currently cannot do this cluster detection in parallel runs, so it would be great to have a way to access the AABB information. At the moment all my particles have roughly the same size, so I suppose I could just fix some AABB valid to some approximation for all particles, but in the future I would like to vary the particle sizes, so having access to the actual AABB data would be necessary. Of course, if there was a way to do this sort of cluster detection directly from chrono that would be optimal, but I did not manage to understand if such a thing already exists.

Being able to remove bodies from a parallel simulation would also be really useful as I have some bodies which can be on escape velocities, and as they get farther away they complicate things both with respect to my tree code for computing the gravitational accelerations and, perhaps, also to the broad phase collision detection? This is not so critical at this time though, I believe.

I have a couple of extra somewhat-related questions:

- does chrono::parallel work in conjunction with FSI? we are thinking about adding a gas continuum in our particle collapse simulations, and it sounds like we could use FSI for that. As far as I have understood FSI runs on CUDA, but can then it be coupled to chrono::parallel for dealing with the gravitational/particle part of the simulation? does it make any sense?

- in the simulation video I posted above you can see that when large bodies form, their particles are kind of boiling and bouncing around all the time (see for instance the main central body around 6:16). For smaller bodies, the particles seem to be (more) at rest. My first intuition would have been that the particles would be quickly at rest after collapsing into larger agglomerates. I also remember reading in a chrono whitepaper that certain aspects of collisional phenomena are simulated as dampened harmonic oscillators (I believe?), so I was wondering if the behaviour shown there is physical or it is rather an artifact of the approximation used in the simulation.

Thanks again a lot and kind regards,

  Francesco.

Radu Serban

unread,
Oct 31, 2018, 3:42:51 AM10/31/18
to projec...@googlegroups.com

Francesco,

For now a couple more pointers to help with your current simulations.

I agree with your assessment of what was happening when using a small number of bins.  Indeed, the binning works by first finding a global AABB (which encompasses all AABBs of collision shapes in the system) and then dividing that large box in equal-size bins.  The small number of bins and the clusters in your simulation effectively led to a large number of sequential tests within the crowded bins. 

As you can probably see, this is still an issue if you have bodies which get farther and farther from your domain of interest.   There is one trick in Chrono::Parallel that you can use to address this issue, but you will need to ensure that does not affect the physics of the problem you're trying to solve.  Indeed, you can specify a large AABB that will define a boundary that your bodies cannot cross.  If you enable this option, as soon as a body reaches the boundary of this AABB, the body will be deactivated (it will not be removed from the system, but it will become a zombie).   This feature ensures that the computational domain stays bounded and a single body being "ejected" does not negatively affect performance of the collision detection.

You can enable this feature with code like this:
      my_system.GetSettings()->collision.use_aabb_active = true;
      my_system.GetSettings()->collision.aabb_min = bmin;
      my_system.GetSettings()->collision.aabb_max = bmax;
where (bmin, bmax) are triplets that define the "active area" AABB.

In your case, you probably need to specify such a box that is large enough and then ignore any zombie body from computation of gravitational forces (hopefully, this is an acceptable approximation).   You can test whether a body is "active" or a "zombie" by testing the corresponding entry in the vector:
      my_system.data_manager->host_data.active_rigid
using the body id as index.   If you have a shared pointer to a body, you can do a test like so:
      if (my_system.data_manager->host_data.active_rigid[body->GetId()] != 0)
         // active body
      else
         // zombie

Removing bodies from a Chrono::Parallel system is not straightforward.  This is because how data is organized in arrays with different levels of indirection.  Removing a body would mess with the indexing of bodies and shapes and would require recalculating all indices.

Finally, shape AABBs.  These are accessible in vectors in the data manager:
       my_system.data_manager->host_data.aabb_min
   my_system.data_manager->host_data.aabb_max
However, these arrays are indexed by a shape id.   So you can loop over all shapes in the system, access its current AABB, find the id of the body associated with that shape, and so on.  But not the other way around (at least not efficiently).   You can get a better picture of this if you look at the implementation of a function such as ChCAABBGenerator::GenerateAABB in ChAABBGenerator.cpp (line 110).

One more thing: the AABBs stored in the above two arrays are shifted such that they all lie in the 1st octant (i.e. they all have only positive values).  You can find the current value of the offset in
   my_system.data_manager->measures.collision.global_origin
and undo the offset if needed (i.e. apply the inverse of the shift implemented in ChBroadphase::OffsetAABB).

Having said that, I think it'd be possible to provide an implementation for ChCollisionModelParallel::GetAABB (currently absent).  I'll look into this. 

--Radu

Reply all
Reply to author
Forward
0 new messages