Scaling WRF Post-Processing in Python with ProcessPoolExecutor

4 views

Skip to first unread message

Will H

unread,

Nov 26, 2025, 11:08:57 AMNov 26

to wrfpython-talk

Minimize image
Edit image
Delete image

Numerical weather prediction models like WRF can produce terabytes of
output. Turning that raw data into actionable fields and graphics often
becomes a bottleneck: parcel diagnostics, CAPE/CIN, lifted indices,
radiation budgets, and other derived quantities can be computationally
expensive.

Many of these calculations are “embarrassingly parallel”: each WRF file,
each time step, or each grid column can be processed independently. Yet a
lot of Python post-processing still runs serially on a single core.

This is where ProcessPoolExecutor from concurrent.futures becomes a
powerful tool. Used correctly, it lets you exploit multiple CPU cores for
heavy WRF diagnostics and post-processing workflows.
------------------------------
The core idea: bypassing the GIL for CPU-bound work

Python’s Global Interpreter Lock (GIL) prevents true parallel execution of
Python bytecode within a single process. Threads are great for I/O-bound
tasks, but CPU-intensive numerical work doesn’t scale well with threading.

ProcessPoolExecutor solves this by:

-

Spawning multiple separate processes, each with its own Python
interpreter and its own GIL.
-

Allowing CPU-bound functions (numerics, MetPy, SciPy, WRF diagnostics,
fsolve, etc.) to run truly in parallel.
-

Providing a simple high-level API (map, submit) for dispatching work
units across cores.

If your WRF workflow looks like:

-

Loop over dozens of wrfout files,
-

Or loop over many time steps in one file,
-

Or loop over grid points performing identical, heavy calculations,

then ProcessPoolExecutor is often an immediate win.

A common, practical strategy in WRF post-processing is to select a small
fixed number of workers, for example:
from concurrent.futures import ProcessPoolExecutor MAX_WORKERS = 4 #
hard-limit: at most 4 processes

This makes CPU usage predictable and prevents oversubscribing the machine.
------------------------------
Where ProcessPoolExecutor shines in WRF workflows

Typical WRF post-processing includes operations such as:

-

Derived thermometer / parcel profiles (CAPE, CIN, lifted index),
-

Moist adiabatic lifting and non-linear solves,
-

Wet-bulb globe temperature (WBGT) or radiation balances,
-

Generating many static PNGs for different times or domains.

These are ideal for process-based parallelism because:

-

Each file, time, or grid point is independent.
-

Work is CPU-heavy and dominated by numerical routines, not I/O.
-

There is minimal need for shared mutable state.

Broadly, there are three patterns that work well:

1.

Parallel over WRF output files
2.

Parallel over time steps within a file
3.

Parallel over rows/columns or chunks of the grid

The first two are generally the most robust and memory-friendly.
------------------------------
Pattern 1: Parallel over WRF output files

This is often the cleanest approach: treat each wrfout file as a separate
job.
Step 1: Define a worker function

The worker should:

-

Open the NetCDF file,
-

Compute whatever diagnostic you need (e.g., a 2D field),
-

Create and save a plot or return summary values.

Example:
from concurrent.futures import ProcessPoolExecutor import glob from netCDF4
import Dataset import numpy as np import wrf import matplotlib.pyplot as
plt import cartopy.crs as ccrs def process_wrfout_file(path): # 1. Open
file with Dataset(path) as nc: # Example: 500-hPa temperature (or any
derived field) t_500 = wrf.getvar(nc, "temp", units="K") lats, lons =
wrf.latlon_coords(t_500) proj = wrf.get_cartopy(t_500) # 2. Heavy
diagnostic (placeholder example) diagnostic = np.log1p(t_500) # replace
with real computation # 3. Plot fig = plt.figure(figsize=(10, 8)) ax =
plt.axes(projection=proj) ax.coastlines() cf = ax.contourf(
wrf.to_np(lons), wrf.to_np(lats), wrf.to_np(diagnostic),
transform=ccrs.PlateCarree() ) plt.colorbar(cf, ax=ax,
orientation="horizontal", pad=0.05) # 4. Save image out_png =
path.replace("wrfout_", "").replace(".nc", "_diagnostic.png")
plt.savefig(out_png, dpi=150, bbox_inches="tight") plt.close(fig) return
out_png
Step 2: Dispatch the work with a process poolif __name__ == "__main__":
wrfout_files = sorted(glob.glob("/path/to/run/wrfout_d02_*")) with
ProcessPoolExecutor(max_workers=4) as executor: outputs =
list(executor.map(process_wrfout_file, wrfout_files)) print("Finished:",
outputs)

Why this works well:

-

Each process opens its own NetCDF file; you do not share file handles.
-

Large arrays are kept local to the worker process.
-

The main process passes only file paths (small, cheap to send).
-

CPU-heavy computations and plotting are spread across multiple cores.

------------------------------
Pattern 2: Parallel over time steps within a file

Sometimes a single wrfout file holds many time steps, and each time step
requires an expensive calculation (e.g., full CAPE fields every hour).

You can parallelize over time indices:
from concurrent.futures import ProcessPoolExecutor from netCDF4 import
Dataset import wrf def process_time_index(args): path, t_index = args with
Dataset(path) as nc: # Get variable at this time index # Actual code will
depend on how you index with WRF/xarray var = wrf.getvar(nc, "cape_2d",
timeidx=t_index) # Heavy calculations or plotting here # ... return
f"{path} time {t_index} done" if __name__ == "__main__": path =
"/path/to/wrfout_d02_2025-07-01_00:00:00" # Assuming 0..n_times-1 n_times =
24 tasks = [(path, t) for t in range(n_times)] with
ProcessPoolExecutor(max_workers=4) as executor: results =
list(executor.map(process_time_index, tasks)) for r in results: print(r)

This pattern is helpful when:

-

One large file contains many forecast times.
-

You want each process to reuse the same file path but operate on
different time indices.

------------------------------
Pattern 3: Parallel over grid rows or chunks

Some diagnostics are fundamentally local to each column (e.g.,
thermodynamic profiles, parcel lifts, nonlinear solves). In principle, you
can parallelize over grid rows or tiles.

However, there is a trade-off:

-

Large NumPy arrays must be serialized and sent to each worker.
-

This can consume memory and introduce overhead.

This strategy is best when:

-

The grid is modest in size, or
-

The per-column work is extremely expensive (so CPU time dominates any
overhead).

A simplified example:
from concurrent.futures import ProcessPoolExecutor import numpy as np def
process_row(j, field_3d): nz, ny, nx = field_3d.shape row_result =
np.zeros(nx, dtype=float) for i in range(nx): profile = field_3d[:, j, i] #
Heavy column-wise computation here row_result[i] = np.max(profile) #
placeholder return j, row_result if __name__ == "__main__": data_3d =
np.random.rand(30, 100, 100) # example (nz, ny, nx) nz, ny, nx =
data_3d.shape tasks = [(j, data_3d) for j in range(ny)] with
ProcessPoolExecutor(max_workers=4) as executor: results =
list(executor.map(lambda args: process_row(*args), tasks)) # Reassemble
into 2D field out_2d = np.zeros((ny, nx), dtype=float) for j, row_vals in
results: out_2d[j, :] = row_vals

For realistic WRF domains, it is often better to parallel over files or
time steps first, before considering grid-level parallelism.
------------------------------
Practical guidance and pitfalls

A few best practices for using ProcessPoolExecutor in WRF pipelines:

1.

Hard-limit worker processes Use a constant such as MAX_WORKERS = 4 and
pass max_workers=MAX_WORKERS. This keeps CPU usage predictable and
avoids oversubscribing cores.
2.

Use the if __name__ == "__main__": guard On Windows (and sometimes in
other environments), this is required to prevent child processes from
re-importing and re-executing your module indefinitely.
3.

Keep worker functions pure and self-contained Workers should:
4.

Pass file paths, not huge arrays Whenever possible, let each worker
process open the WRF output file directly rather than sending large 3D
arrays across processes.
5.

Be mindful of memory usage Four workers each holding several large WRF
fields can use many gigabytes of RAM. Monitor memory and adjust
max_workers accordingly.
6.

Handle failures gracefully Wrap heavy operations inside try/except in
the worker function if you want to skip bad files or time steps without
killing the whole job.

------------------------------
Conclusion

For WRF post-processing and diagnostics, ProcessPoolExecutor is a
straightforward way to:

-

Exploit multiple CPU cores,
-

Reduce total wall-clock time for heavy calculations,
-

Keep your Python code clean and maintainable.

By structuring your workflow around independent units of work—files, times,
or grid chunks—and dispatching them to a small process pool, you can turn
slow, single-core loops into efficient, multi-core pipelines without
rewriting everything in a compiled language.

If you are already doing multi-file or multi-time WRF plotting and
diagnostics in Python, adding process-based parallelism is often one of the
highest-impact performance improvements you can make.

Reply all

Reply to author

Forward

0 new messages