DEME-Engine single sphere drop performance

Yves ROBERT

unread,

Jan 30, 2024, 12:27:17 PM1/30/24

to ProjectChrono

Hello,

I am working on a problem which involves dropping one sphere at a time in a geometry from its top in DEME-Engine. The geometry can have multiple hundreds of thousands of spheres poured in it, so I would need something efficient. The constraint is that I have to always drop the sphere with a null velocity from the same spot.

The problem I have is that it is very slow.

I made an example attached, where I fast-forward to 50,000 spheres in the geometry, then drop them one by one. When measuring the performance (see log attached), I obtain something like 6.2 seconds per drop. The overhead I measured, when starting from 0, was ~0.2s, so it gives 6/50000=120e-6 s/sphere. If I adjust perfectly the step size to have a drop, that means that to fill the geometry with, says 500,000 spheres, it would take me around 6 months of computation to complete.

Therefore, I write to see if:

Something is wrong my script.
Some values can be safely relaxed. The Young's modulus and other sphere parameters were taken from a paper, so I would prefer not to touch it. The time step seems already fairly high in my example.
If there are techniques that could be applied to lower the computation for this kind of common problem.

Thank you!

test_insertion.py

log_insertion.txt

Ruochun Zhang

unread,

Feb 1, 2024, 3:35:11 AM2/1/24

to ProjectChrono

Hi Yves,

I only had a brief look at the script. So what you needed is to add more spherical particles into the simulation, one by one, and I assume you need to do this thousands of times.

The problem is that adding clumps, or say UpdateClumps(), is not designed to be called too frequently, and it's really for adding a big batch of clumps. When you call it, you need to sync the threads (perhaps the cost of one round of contact detection), then the system goes through a process that is similar to initialization (no just-in-time compilation, but still a lot of memory accesses). Although I would expect it to be better than what you measured (6.2s), maybe you also included the time needed to advance a frame in between---I didn't look into that much detail.

In any case, it's much better to get rid of adding clumps. If you know how many you will have to add eventually, then initialize the system with them in, but frozen (in a family that is fixed and has contacts disabled with all other families). Track these clumps using a tracker (or more trackers, if you want). Then each time you need to add a clump, use this tracker to change a clump in this family (using offset, starting from offset 0, then moving on to 1, 2... each time) to be in a different family so it becomes an "active" simulation object. Potentially, you can SetPos this clump before activating it. This should be much more efficient, as a known-sized simulation should be. As for material properties, I don't think they have significant effects here.

Let me know if there is any difficulty implementing it,

Ruochun

Yves ROBERT

unread,

Feb 2, 2024, 10:31:02 AM2/2/24

to ProjectChrono

Hello Ruochun,

Thank you for your answer.

That makes a lot of sense, especially since, in my case, I know how many I need from the beginning.

Your proposed method is quite smart; I will try to implement it. I will run some tests and come back here to report the difference.

Something else I was also wondering is there any way to kind of "relax" the problem in some parts of the geometry? The bottom of the geometry will not see large velocities and strong changes once few spheres have covered it, and that applies to the layers above later in the simulation.

If that is a possibility somehow, I am expecting this to be a large time saver as well.

Thank you,

Yves

Ruochun Zhang

unread,

Feb 3, 2024, 7:21:11 AM2/3/24

to ProjectChrono

Hi Yves,

In terms of further simplifying the computation, my understanding is that if the scale of your simulation is around 50,000 or 100,000 particles, then saving time by partially "relaxing" the simulation domain is probably not necessary. This is because the number of bodies is low anyway, and further reducing the "effective" number of active simulation bodies might further blur the performance edge of a GPU-based tool. However, letting the simulation cover a longer simulation time using fewer time steps should always help.

I feel the best approach is to dynamically select the time step size. If you know during certain periods of the simulation, everything is relatively "dormant", then you can use large time step sizes during it, using the method UpdateStepSize. You can change it back using the same method if you believe a collision that requires fine time steps to resolve is about to happen.

If you still wish to "relax" a subset of the clumps in the simulation, then perhaps family-based magics are the way to go. If you believe some clumps are effectively fixed in place during a period, then you can again, freeze them using the approach I discussed above. This indeed saves time because those clumps will simply not have contacts among themselves. You could also massage the material associated with a subset of the clumps using the method SetFamilyClumpMaterial. However, I have to mention that different material properties hardly make any impact on computational efficiency. Soft materials with more damping could allow for a more lenient time step size selection, but the step size is still determined by the "harshest" contact that you have to resolve.

The ultimate tool is of course the custom force model. If you can design a model that is fast to solve and accurate enough for you, and potentially resolves different parts of the simulation domain differently like you wished, that's probably the best. For a starter, if you do not need friction, then try calling UseFrictionlessHertzianModel() before system initialization to use the frictionless Hertzian contact model. And you can develop even cheaper and more specific models after that.

Thank you,

Ruochun

Yves ROBERT

unread,

Mar 19, 2024, 10:39:42 AM3/19/24

to ProjectChrono

Hi Ruochun,

Based on that discussion, I developed a routine that creates the target number of spheres (say, 500,000), puts them into an "idle" family which should not interact with the environment, and then converts them one by one to fill the core.

Here is a simple snap of an example:

# Dummy input
World = {'half_width':5, 'height':20, 'family':0}
sphere = {'radius':0.05, 'family': 1, 'idle_family':2, 'N_target':500000}
sphere['template'] = DEMSim.LoadSphereType(sphere['mass'], sphere['radius'], material)


# Set an idle family which does not interact with the environment
DEMSim.DisableFamilyOutput(sphere['idle_family'])
DEMSim.SetFamilyFixed(sphere['idle_family'])
DEMSim.DisableContactBetweenFamilies(sphere['idle_family'], sphere['idle_family'])
DEMSim.DisableContactBetweenFamilies(sphere['idle_family'], sphere['family'])
DEMSim.DisableContactBetweenFamilies(sphere['idle_family'], World['family'])

# Add spheres before initializing
dummy_pos = [[np.random.uniform(-World['half_width'] + sphere['radius'], World['half_width'] - sphere['radius']), np.random.uniform(-World['half_width'] + sphere['radius'], World['half_width'] - sphere['radius']), 0] for _ in range(sphere['N_stored'])]
dummy_vel = [[0.,0.,0.]] * sphere['N_stored']
sphere['object'] = DEMSim.AddClumps(sphere['template'], dummy_pos)
sphere['object'].SetVel(dummy_vel)
sphere['object'].SetFamilies([sphere['idle_family']]*sphere['N_target'])
sphere['tracker'] = DEMSim.Track(sphere['object'])

# Initialize
sphere['N_inserted'] = 0
DEMSim.Initialize()

# Run and insert spheres at the top of the geometry when possible
while sphere['N_inserted'] < sphere['N_target']:
    if can_insert():
        sphere['N_inserted']
        sphere['tracker'].SetPos(pos=[0,0, World['height']-sphere['radius']], offset=sphere['N_inserted'])
        sphere['tracker'].SetVel(vel=[0,0,0], offset=sphere['N_inserted'])
        sphere['tracker'].SetFamily(fam_num=sphere['family'], offset=sphere['N_inserted'])
        sphere['N_inserted'] += 1
        DEMSim.DoDynamics(dt)

As you can see, I have to put the spheres in dummy dispersed positions otherwise I get an error about having too many spheres in the same bin, but is it the way to do it for me?

Then, time-wise, each step is faster than before (thanks to the removal of UpdateClumps) but still quite slow, and scales linearly with the number of target spheres and not really the number of currently inserted spheres.

Am I doing this wrong?

Thank you

Ruochun Zhang

unread,

Mar 22, 2024, 5:29:35 AM3/22/24

to ProjectChrono

Hi Yves,

About that you have to create inactive particles at dummy locations, and the scaling is not the most ideal, could both be remedied when a mechanism that fully freezes a family (not just disable contact and fix in position) is implemented. Like I said last time, I haven't got the time to do that. But with what we have now, you are doing this correctly.

To understand how the performance can be improved, we have to understand the time spent on each sub-task. There are a few possibilities that I can think of just by looking at the script:

1. If dt is extremely small (close to the step size length), then perhaps most of the runtime is spent on changing families. We can do very little if the physics of the problem calls for that.

2. But I guess dt is in fact some meaningful amount of time (say 0.05s). In that case, the first concern is that by adding spheres one by one, this makes an extremely long simulation, thousands of seconds. What we can do to improve such a long simulation is limited since there is a limit on how much we can optimize the step size. You first should think if you need to do this in the first place, or at least make sure you just do this once, and save the simulation status for future use after this one time is done. Then, in this case, the majority of runtime should be on DoDynamics(dt). We need to know the load balance (CollaborationStats and TimingStats). It might be that the contact detection is so heavy but the number of contacts is so low, dT is essentially always waiting for kT. The first 50k or so particles do not scale anyway (due to using GPU), then past that, the scaling is probably more affected by the load balance.

3. I assume by scaling you meant, say right now there are 100k particles in the simulation, then at the same time, having 400k yet-to-be-inserted particles makes the simulation run roughly twice as slow as having 150k yet-to-be-inserted particles. Let's make sure we are saying the same thing, because well, obviously the entire simulation scales with N_target because you have that many for loops.

Thank you,

Ruochun

Yves ROBERT

unread,

Mar 22, 2024, 10:12:27 AM3/22/24

to ProjectChrono

Ruochun,

Thank you for your answer.

dt is quite large compared to the step size length, your estimate is correct.

I will run a simple simulation to get you the CollaborationStats and TimingStats.

I can always simplify the case, but it would lead to a much less realistic distribution if I, for example, do a column instead of a single drop since we would have the wrong speed when colliding with the packed bed.

I could also make a disk, but it would drastically change the distribution, especially at the top where a cone should be formed.

You got my poor definition of scaling right, I apologize for the confusion.

Since I want to recirculate the spheres once dropped, and that many times ("deleting" and creating new ones to mimic that behavior), if I have 500,000 spheres that I want to recirculate 10 times, dropping spheres one by one, the performance drop by introducing 5,000,000 spheres instead of 500,000 is huge.

Restarting the calculation every time would be quite a headache as well, and would still lead to the simulation of 1,000,000 spheres.

Thank you,

Yves

Ruochun Zhang

unread,

Mar 22, 2024, 4:09:29 PM3/22/24

to ProjectChrono

Let's see how the timing stats are. And you decide the physics. If inserting one by one is what you need, then... even if it's a 1000s simulation, you probably have to brute force through.

You know, in your case I think the best approach is a hybrid one: You UpdateClumps() each time with a small batch, say 1000 spheres. All those 1000 spheres start with a dormant family. Then they are activated one by one. Once they are all done, you add another batch of 1000. This should scale better (theoretically scales with the number of currently alive clumps), and since the batch is bigger, you don't have that many UpdateClumps(), so it's not too expensive there, either. But of course, if the load balance is indeed very bad, this probably does not resolve everything.