A better load balancing algorithm

Riccardo Caraccio

unread,

Jun 20, 2026, 6:41:49 AMJun 20

to basilisk-fr

Dear Basilisk users,

Over the past month, I've been working with Prof. Popinet to develop a more efficient method for assigning cells to each processor in a parallel MPI simulation on a quad/octree grid. As of now, Basilisk tries to assign an even number of cells to each processor, without considering that some areas of the domain may carry a heavier computational load concentrated in a particular zone. For example, this can happen when you have a lot of computation on the interface of a VOF field or at the boundaries of the domain. This is the problem I set out to solve.

I've attached the preliminary patch to this conversation, which will later be merged into src (hopefully). If you'd like, you can try it out and let me know any feedback you may have. Below you can find a description of the changes.

The main idea is to evenly redistribute a total "load" rather than the number of cells. This information is held in the (const) scalar field balance_weights, which by default is unity. If you do not specify anything, the balancing algorithm automatically defaults to the current behaviour, i.e., each processor gets the same number of cells. To activate the modified version, you have two options:

1. You specify the balance_weights field yourself by allocating the scalar field, something like:
event init (i = 0) {
balance_weights = new scalar;
}
Then you can assign any value that is proportional to what you think your computational effort is. It can come from your intuition or from something you measure yourself, for example:
foreach()
balance_weights[] = f[]*20. + 1.;
The only constraint is that balance_weights[] >= 0. with at least one cell non null (values can also be decimals, no need to normalize the field). If the total weight is 0, the balancer simply does nothing and leaves the current distribution unchanged.

2. You enable automatic detection of the weights through the compilation flag CFLAGS+=-DLB_AUTO=1. This makes Basilisk try to minimize the communication/synchronization time spent exchanging information between processors. Note that this approach is based on the timing of Basilisk's MPI wrappers ( like mpi_all_reduce). Any MPI communication you do outside these wrappers is not counted, so heavy use of "raw" MPI calls in your own code may make the estimate less accurate.

Attached with the patch are also two new test cases (based on src/test/rotate.c) that show examples: balance-rotate.c, using option 1, and balance-rotate-auto.c, using option 2.

Note that activating this should have NO IMPACT on the results of your simulations, only a reduction (hopefully) of your computational time.

Again, I appreciate any feedback (or wishes) you may have.

Thanks,

Riccardo Caraccio

load_balancing.patch

Stephane Popinet

unread,

Jun 20, 2026, 6:46:55 AMJun 20

to basil...@googlegroups.com

Hi Riccardo,

> Note that activating this should have NO IMPACT on the results of your
> simulations, only a reduction (hopefully) of your computational time.

This is not strictly true because the details of the parallel
partitioning can affect the convergence (and thus the results) of the
multigrid linear solvers. So in these cases you can expect some changes,
but only below the TOLERANCE of the linear system solutions.

cheers,

Stephane

j.a.v...@gmail.com

unread,

Jun 24, 2026, 8:11:40 AMJun 24

to basilisk-fr

Hallo Riccardo,

Very interesting patch, thank you for sharing. It is very user-friendly. But I encountered some unexpected behaviour, when attemping to design such a new strategy.

To illustrate, I test with,

foreach()
balance_weights[] = dv(); // (see below)

which should, in my theory, result in a static equi-volume partioning of the domain. But the decomposition is not 100% static when chaning the mesh (and refreshing the weights).

Is this a bug? Or do you have any ideas to get the behaviour I tested for?

Antoon.

Reproduce with 3-procs:

#include "utils.h"

int main() {
init_grid (N);
L0 = 3;
X0 = Y0 = -L0/2.;
scalar pids[];
balance_weights = new scalar;
for (double t = 0; t < 2*pi; t += 0.1) {
unrefine (level > 5);
refine (sq(x - sin(t)) + sq(y - cos(t)) < 1 && level < 10);
foreach()
balance_weights[] = dv();
while(balance(balance_weights));
foreach()
pids[] = pid();
output_ppm (pids, file = "pid.mp4", n = 400, min = 0, max = npe() - 1);
}
}

Op zaterdag 20 juni 2026 om 12:46:55 UTC+2 schreef Stephane Popinet:

Riccardo Caraccio

unread,

Jul 1, 2026, 8:19:07 AM (13 days ago) Jul 1

to basilisk-fr

Hi Antoon,

Thanks a lot for the feedback, I really appreciate it. Sorry for the late reply, but it took me a while to pinpoint the issue. I can confirm it's a bug.

Apparently, it's related to the fact that when a cell's ownership flips to another rank, balance() does not recompute that cell's weight. If that cell already existed on the receiving rank as a ghost cell, its pid is simply relabeled, and the value already stored there is now treated as "the true value", without copying the original one from the previous owner. Given that the stencil values of weights are never accessed between balance() iterations, the halos are never reconciled without an explicit boundary call. This is an easy fix that I've added to the patch inside grid/balance.h 250:

if (!is_constant (w))
boundary ({w});
scalar newpid[];
double tl = z_weights (newpid, w, mpi.leaves);

You can also test it in your case with:

do {
boundary ({balance_weights});
} while (balance (balance_weights));

This fix seems to achieve the true (and correct) static partition that you were looking for.

Let me note finally that I've tried to tackle this to the best of my knowledge, as I am only recently getting into the low-level architecture stuff, and I may be wrong about the true reason. I guess this is a further point I may need to discuss with Stéphane.

Again, thank you very much for this very interesting test. I’d love to chat more about this if you have further questions.

Best,

Riccardo

Arun K Eswara

unread,

Jul 1, 2026, 10:32:22 AM (13 days ago) Jul 1

to Riccardo Caraccio, basilisk-fr

Dear Riccardo,

I have noted adapt_wavelet alone accounts for 80 - 85% of the simulations's per step wall clock cost. If you are working on an efficient implementation, check how we can improve this constraint.

Regards,

Arun

--
You received this message because you are subscribed to the Google Groups "basilisk-fr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to basilisk-fr...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/basilisk-fr/56a13ed4-deb6-4e0b-a28a-a2ff374fb42bn%40googlegroups.com.

Riccardo Caraccio

unread,

Jul 3, 2026, 12:16:29 PM (11 days ago) Jul 3

to basilisk-fr

Dear Arun,

Thank you for the note. I think there may be a misunderstanding about the scope of this patch, so let me clarify.

If you are referring to the speed of the adaptation algorithm itself (i.e., how fast the mesh is modified), then that is not the problem this patch addresses. My goal is a better distribution of cells across processors, so that the other operations in the step are parallelized at maximum efficiency.

The 80–85% figure you report suggests your simulation is dominated by adaptation because it does comparatively little other work per step (for example, if you are solving essentially only the Navier–Stokes equations). In that regime, the adaptation cost is largely an unavoidable, fixed overhead, and this patch is not designed to reduce it. It pays off when there is substantial per-cell work whose imbalance the default cell split fails to correct: there, redistributing the load yields a significant net speedup even though the adaptation cost is unchanged.

If, on the other hand, you are observing that applying this patch makes the adaptation time itself skyrocket, then that would be unexpected, and I would like to look into it. Please send me a minimal reproducible example. For what concerns my part, I made sure to keep the load-balancing step itself parallelized for maximum efficiency, so it can safely be run every iteration, exactly as you would with the current adaptation algorithm.

Best regards,
Riccardo

Arun K Eswara

unread,

Jul 7, 2026, 4:53:07 AM (7 days ago) Jul 7

to Riccardo Caraccio, basilisk-fr

Dear Riccardo,

Yes I was trying to speed up my simulation on GPU. May be I understood it wrongly on your script. However, as suggested, I am attaching my script and a recent presentation that I have made on BMM - 30th June 2026.

The script with other diagnostics will take about 20 days to run on RTX 5090 GPU / AMD Ryzen Threadripper PRO 7965WXs × 48 / 129 GiB - for a simulation time (physical) of 60s.

The hindrance that I am noting - stems from physics (Cryogenic properties), and the existing octree grid (3D) doesn't seem to be on GPU. Either the adapt_wavelet or the grid (most likely) require that time to advance the simulation. If you find any improvement - let me know.

Best wishes,

Arun

To view this discussion visit https://groups.google.com/d/msgid/basilisk-fr/14eba88d-8306-402d-af66-9db6f4c4a18cn%40googlegroups.com.