low WRF-Hydro calibration runs for small basins on SLURM

9 views
Skip to first unread message

Fatemeh Shirkhanloo

unread,
May 19, 2026, 5:27:14 PM (6 days ago) May 19
to wrf-hydro_users

Dear WRFHydro Users,

I hope you are doing well.

I am running WRF-Hydro calibration jobs using Slurm for several sub-basins. The setup works well for the larger basins, but I am experiencing unexpected slowdowns for about 15 smaller basins.

Because these basins have very small land grids, I cannot increase the number of MPI tasks. When I use more CPUs, WRF-Hydro fails with the following error:

Error: number of processes greater than number of cells in the land grid

For these small basins, I am currently using only 2 MPI tasks per job. Initially, I thought the slowdown might be due to multiple small jobs being placed on the same compute node. To test this, I added #SBATCH --exclusive to the generated Slurm scripts. Now each calibration group job is allocated a full node exclusively. For example, Slurm shows:

NumNodes=1 NumCPUs=20 NumTasks=2 CPUs/Task=1
ReqTRES=cpu=2,mem=4000M,node=1,billing=2
AllocTRES=cpu=20,mem=40000M,node=1,billing=20

However, the slowdown still occurs.

The main issue is that the model starts fast, but gradually becomes slower as the simulation advances in model time. For example, for one small basin, the time between monthly restart files increases from about 1–2 minutes early in the simulation to about 6 minutes later in the simulation:

2012–2013: about 1–2 minutes between monthly restart files
2014: about 2.5–3.7 minutes
2015: about 4–4.8 minutes
2016–2017: about 5–6.5 minutes

The output directory does not appear to be very large. For one basin, the RUN.CALIB/OUTPUT directory contains about 184 files and is about 48 MB. Most model outputs are disabled, and only limited output is being written.

The forcing directory for this basin contains about 118,000 hourly LDASIN files, with a total size of about 1.9 GB. Each file is small, around 16 KB.

I wanted to ask if anyone has seen this behavior before or has suggestions on what might be causing the small-basin calibration runs to slow down over time, even when each job is running on an exclusive node.

Best regards,
Fatemeh 

Reply all
Reply to author
Forward
0 new messages